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Abstract 



The major challenge in designing a discriminative learning algorithm for pre- 
dicting structured data is to address the computational issues arising from 
the exponential size of the output space. Existing algorithms make differ- 
ent assumptions to ensure efficient, polynomial time estimation of model 
parameters. For several combinatorial structures, including cycles, partially 
ordered sets, permutations and other graph classes, these assumptions do 
not hold. In this thesis, we address the problem of designing learning al- 
gorithms for predicting combinatorial structures by introducing two new 
assumptions: 

(i) The first assumption is that a particular counting problem can be 
solved efficiently. The consequence is a generalisation of the classical 
ridge regression for structured prediction. 

(ii) The second assumption is that a particular sampling problem can be 
solved efficiently. The consequence is a new technique for designing 
and analysing probabilistic structured prediction models. 

These results can be applied to solve several complex learning problems 
including but not limited to multi-label classification, multi-category hier- 
archical classification, and label ranking. 
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Notational Conventions 



We will follow the notational conventions given below wherever possible. 

• Calligraphic letters {A,B . . .) denote sets (or particular spaces): 

— X an instance space. 

— y a label set. 

— Ha Hilbert space. 

• Capital letters {A, B,...) denote matrices or subsets of some set. 

• Bold letters or numbers denote special matrices or vectors: 

— I the identity matrix, i.e., a diagonal matrix (of appropriate di- 
mension) with all components on the diagonal equal to 1. 

— the zero element of a vector space or a matrix with all compo- 
nents equal to 0. For the vector spaces the zero element is 
the vector (of appropriate dimension) with all components equal 
to 0. 

— 1 the matrix (in M**^"*) or the vector (in W^) with all elements 
equal to 1. 

• Lowercase letters (a, 6, . . .) denote vectors, numbers, elements of some 
set, or functions: 

— m the number of training instances. 

— X a single instance. 

— y a single label. 

• Lowercase Greek letters a,P,... denote real numbers. 

• Bold lowercase Greek letters a,/3,... denote vectors of real numbers. 

• Symbols: 

— A^B: if A then B. 

— B: Aif B. 



NOTATIONAL CONVENTIONS 



— A<^B: A a and only if B. 

— f : X ^ y denotes a function from X to y. 

— /(•): to clearly distinguish a function f{x) from a function value 
f{x), we use /(•) for the function and /(x) only for the value 
of the function /(•) applied to x. This is somewhat clearer than 
using / for the function as (out of context) / could be read as a 
number. 

— denotes the transpose of the matrix A. 

— I • I denotes the function returning the £i norm of a vector. 

— II • II denotes the function returning the £2 norm of a vector. 

— |n] denotes the set {1, . . . , n}. 

• Other notational conventions and exceptions: 

— Aij denotes the component in the i-th row and j'-th column of 
matrix A. 

— Ai. denotes the z-th row vector of matrix A. 

— A.j denotes the j-th column vector of matrix A. 

— A denotes the regular isation parameter. 

— Pa probability distribution. 

— M the set of all real numbers. 

— N the set of all natural numbers 1,2,3, 



Chapter 1 

Introduction 



The discipline of machine learning ( Mitchell . 20061 ) has come a long way 
since Frank Rosenblatt invented the perceptron in 1957. The perceptron 
is a linear classifier: it takes a sequence of features as its input, computes 
a linear combination of the features and a sequence of weights, feeds the 
result into a step function (also known as activation function), and out- 
puts a binary value (0 or 1). A non-linear classifier can be designed using 
multiple layers of computational units (neurons) with non-linear activation 
function such as the sigmoid function resulting in what is called a multi- 
layered perceptron (MLP) or a feed- forward network (see F igure II .ip . An 
MLP is a universal function approximator (jCvbenkol . \198^ ). i.e, any feed- 
forward network with a single hidden layer comprising a finite number of 
units can be used to approximate any function. The weights of an MLP are 
learned from a given set of trai ning examples using a procedure called back- 
propagation (iRumelhart et al. I [l986) which typically modifies the weights 
iteratively based on the error incurred on the current training example, with 
training examples being fed into the network in a sequential manner. We 
thus have a machinery that is able to learn from experience and can be used 
for prediction thereby mimicking human behaviour (to a certain extent). 

This thesis is concerned with structured prediction — the problem of 
predicting multiple outputs with complex internal structure and dependencies 
among them. One of the classical structured prediction problems is temporal 
pattern recognition with applications in speech and handwriting recognition. 
The MLP as described above can be used to approximate a multi-valued 
function using multiple units in the output layer. The downside, however, is 
that it cannot model dependencies between the outputs since there are no 
connections bet ween uni t s within a layer. Recurrent neural networks or feed- 
back networks (jJordanl . Il986l : lElmanl . Il990l : iHochreiter and Schmidhuben 
address this problem by introducing feedback connections between 



units (see Figure [TT^ . The feedback mechanism can be seen as introducing 
memory into the network which makes the network particularly suitable to 
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Figure 1.1: A multi-layered perceptron with input, output, and hidden lay- 
ers. 



solve temporal pattern recognition problems^. Despite the fact that artifi- 
cial neural networks can be used to handle t emporal s equen ces, statistical 
models like hidden Markov models (HMM) (|Rabinej . Il989l ) have enjoyed 
significant success in adoption (including commercial use) in speech recog- 
nition problems as they can be train ed efficiently using pro cedures like the 



expectation-maximisation algorithm (jPempster et al.l . 119771 ). An HMM is a 



dynamic Bayesian network that models the joint distribution of inputs and 
outputs by placing a Markovian assumption on the input sequence. 

As evidenced above, structured prediction and algorithms pertaining to 
it have existed since the mid-80s. Later, with the int r oduction of support 



vecto r machines (SVM) in the 90s (jBoser et al.l . Il992l : ICortes and Vapnikl . 
199i), there has been a flurry of activity in formulating and solving ev- 



ery conceivable machine learning problem using tools from convex optimi- 
sation. Unsurprisingly, this has also had an effect on research in struc- 
tured prediction resulting in several algorithmic d evelopments includiii g 
models/algorithms^ like conditi onal randoni fields ( Laffertv et al.l . boOll ). 
max-m argin Markov networks (Taskar et al. . 2003I . 2005 ). and structured 



SVMs (|Tsochantaridis et al.l . I2OO5I 1. The contributions of this thesis follow 



this line of research in the following sense: 

we address some of the limitations of recent structured prediction 



^Temporal structure in dat a can also be model ed using a feed- forward network known 
as time delay neural network fWaib el et al] . ll989l l. 

^Typically, a machine learning algorithm is a mechanism to estimate the parameters 
of an underlying model of the given data. We use the terms model and algorithm inter- 
changeably. 
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3 



Time 




Figure 1.2: A fully-connected recurrent neural network with two unit s, ui 
and U2 , and its equivalent feed- forward network (jRumelhart et alJ . 119871 ) . At 
every time step, a copy of the network is created. Thus the network unfolds 
in time and standard backpropagation can be used to train the network. As 
long as there is a constraint that the weights be the same for all the copies 
of the network over all the time steps, the behaviour of the feed- forward net 



work will be equivalent to that of the recurrent network (jRumelhart et al 



19871 ). The amount of contextual information used depends on the number 



of copies retained while training. 



algorithms when dealing with the specific problem of predicting 
combinatorial structures by proposing new techniques that will 
aid in the design and analysis of novel algorithms for structured 
prediction. 



1.1 Structured Prediction 

We begin with a non-technical introduction to structured prediction focusing 
particularly on applications. A formal algorithmic treatment is deferred 
until the next chapter. 

We restrict ourselves to supervised learning problems where training ex- 
amples are provided in (input, output) pairs. This is in contrast to other 
machine learning problems like density estimation and dimensionality reduc- 
tion that fall under the category of unsupervised learning. More formally, 
let X be the input space. For example, in OCR applications such as hand- 
writing recognition, this space could represent the set of all possible digits 
and letters including transformations like scaling and rotation. Each ele- 
ment X G X is represented as a sequence of features (e.g., image pixels) in 
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M". Let 3^ be the discrete space^ of all possible outcomes. In handwrit- 
ing recognition, this space could be the set of possible outcomes, i.e., digits 
'0'-'9' and letters 'a'-'z'. The goal of a supervised learning algorithm is to 
learn a hypothesis (function) / that maps all elements of the input space 
to all possible outcomes, i.e., f : X ^ y. Typically, we fix the hypoth- 
esis space (decision trees, neural networks, SVMs), parameterise it, and 
learn or estimate these parameters from a given set of training examples 
{xi,yi), . . . , {xmiVm) S <V X 3^, which are all drawn independently from an 
identical distribution (i.i.d.) P over X x y. The parameters are estimated 
by minimising a pre-defined loss function i : y x y ^ M. — such as the 
— 1 loss or the squared loss — on the training examples with the hope of 
obtaining a small loss when predicting the outputs of unseen instances. Of- 
ten, the hypothesis space is further restricted using a mechanism known as 
regularisation so that the learned hypothesis performs well (w.r.t. the loss 
function) not only on training but also on unseen examples. This property 
of a learning algorithm is called generalisation. 

In structured prediction, the output space is complex in the following 
sense: (i) there are dependencies between and internal structure among the 
outputs, and (ii) the size of the output space is exponential in the problem's 
input. As a simple example, consider multi-label classification where the 
goal is to predict, for a given input example, a subset of labels from among 
a set of pre-defined labels of size d. Clearly, the size of the output space y 
is 2*^. A reduction from multi-label to multi-class prediction may not yield 
good results as it does not take the cor r elatio ns between labels into account 
(|McCalluml . Il999l : ISchanire and Singerl . 120001 ) . 



Applications 

Natural language processing (NLP) has always been a driving force behind 
research in structured prediction. Some of the early algorithm ic develop- 



ments in (discriminative) structured prediction (jCollinsl . |2002| ) were moti- 
vated by NLP applications. Part-of-speech tagging is a classical example 
where the goal is to mark (tag) the words in a sentence with their corre- 
sponding parts-of-speech. An attempt to perform this tagging operation 
by treating the words independently would discard important contextual 
information. Linguistic parsing is the process of inferring the grammatical 
structure and syntax of a sentence. The output of this process is a parse 
tree, i.e, given a sentence of words, the goal is to output its most likely parse 
tree (see Figure [L3]l . Machine translation, which is the problem of translat- 
ing text in one natural language into another, is another appli cation where 



structured predition algorithms have been successfully applied (jLiang et al.l . 



•^If the space is continuous, then it is a regression problem. We are only concerned with 
discrete output spaces. 
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Figure 1.3: Illustration of parsing. The input is a sentence and the output 
is its parse tree. 



2006). The reader is referred to the works of Taskar ( 20041 ) and Daume III 
(|2006l ) for more details on structured prediction applications in NLP. 

Another group of applications is based on graph matching. A matching 
in a graph is a set of edges without common vertices. Finding a matching 
that contains the largest possible number of edges in bipartite graphs, also 
known as maximum bipartite matching, is a fundamental problem in com- 
binatorial optimisation with applications in computer vision. For example, 
finding a correspondence between two images is a graph ma tching problern 



and was recently cast as a structured prediction problem (jCaetano et al 
20091 ). Se gmenting three-dimensional images obtained from robot range 



scanners into object categories is an important task for scene un derstanding, 



and w as recently solved using a structured prediction algorithm (jAnguelov et al, 



20051 ). Imitation learning is a learning paradigm where the learner tries to 
mimic the behaviour of an expert. In robotics, this type of learning is useful 
in planning an d structured predic tion has been successfully used to solve 
such prqbleni s ( Ratliff et al. . 20061 ) . The reader is referred to the works of 



Taskaii (j2004l ) and iRatlifa (|2009l ) for more details on structured prediction 
applications in robotics. 

Further applicati ons in bioinfo r matic s an d computational biology can b e 
found in the works of Sonnenburg ( 20081 ) and Frasconi and Passerini (I2OO8I ). 



1.2 Why Predict Combinatorial Structures? 

We have already seen that some of the applications described in the previous 
section involve predicting combinatorial structures such as permutations in 
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maximum bipartite matching and trees in linguistic parsing. Furthermore, 
several existing, well-studied machine learning problems can be formulated 
as predicting combinatorial structures. 



Mult i- lab el classification 



Schapire and Singeij . Il999l . 12000 : lElisseeff and Westonl . 



200ll : iFiirnkranz et al.l . l2008l ) is a generalisation of multi-class prediction 
where the goal is to predict a set of labels that are relevant for a given in- 
put. The combinatorial structure corresponding to this problem is the set 
of vertices of a hypercube. 



Multi-category hierarchical classification ( Cesa-Bianchi et al. . 20061 : 



Rousu et al.ri2006l l is the problem of classifying data in a given taxonomy 
when predictions associated with multiple and/or partial paths are allowed. 
A typical application is taxonomical document classification where docu- 
ment categories form a taxonomy. The combinatorial structure correspond- 
ing to this problem is the set of subtrees of a directed, rooted tree. 



Label ranking (jPekel et al.l . 120031 ) is an example of a complex prediction 
problem where the goal is to not only predict labels from among a finite 
set of pre-defined labels, but also to rank them according to the nature of 
the input. A motivating application is again document categorisation where 
categories are topics (e.g., sports, entertainment, politics) within a docu- 
ment collection (e.g., news articles). It is very likely that a document may 
belong to multiple topics, and the goal of the learning algorithm is to or- 
der (rank) the relevant topics above the irrelevant ones for the document in 
question. The combinatorial structure corresponding to this problem is the 
set of permutations. 



Real-vi^orld Applications 

In this thesis, we focus particularly on the problem of predicting combina- 
torial structures such as cycles, partially ordered sets, permutations, and 
other graph classes. There are several applications where predicting such 
structures is important. Consider route prediction for hybrid vehicles — 
the more precise the prediction of routes, the better the optimisation of the 
charge /dis charge schedule, resulti r ig in significant reduction of energy con- 
sumption ( Froehlich and Krumm , 20081 ). Route prediction corresponds to 



prediction of cycles in a street network. The input space X would be a set 
of properties of people, situations, etc.; the output space y would be the set 
of cycles over the places of interest; and yi are the known tours of people 
Xi. Route prediction could also find interesting applications in the design of 
intelligent personal digital assistants that are smart enough to recommend 
alternative routes or additional places to visit. 

As another application, consider de novo construction of (personalised) 
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drugs from the huge spa ce of synthesisable drugs (e.g., a fragment space) 



ng 

(jMauser and Stahi l2007l ) — better predictions lead to more efficient entry 



of new drugs into the market. The task here is to predict graphs (molecules) 
on a fixed set of vertices. State-of-the-art software systems to support drug 
design are virtual screening methods predicting the properties of database 
compounds. The set of molecules that can be synthesised is, however, orders 
of magnitude larger than what can be processed by these methods. In this 
application, X would be some set of properties; y would be a particular set 
of graphs over, say, functional groups; and yi are the compounds known to 
have properties Xj. 



1.3 Goals and Contributions 



The main goal of this thesis is to design and analyse machine learning al- 
gorithms for predicting com binator i al structures. This problem is not new 



Taskar et al.l . 120031 : iTaskad . l2004l : 



2003 ) to predict structures such 



and there exi s ts algorithms (ICollinsl. l2002l : 
Taskar et all . l2005l : iTsochantaridis et al.l . 
as matchings, trees, and graph partitions. Therefore, as a starting point, we 
investigate the applicability of these algorithms for predicting combinato- 
rial structures that are of interest to us. It turns out that the assumptions 
made by these algorithms to ensure efficient learning do not hold for the 
structures and applications we have in mind. We elucidate the limitations 
of existing structured prediction algorithms by presenting a complexity the- 
oretic analysis of them. We then introduce two novel assumptions based 
on counting and sampling combinatorial structures, and show that these as- 
sumptions hold for several combinatorial structures and complex prediction 
problems in machine learning. The consequences of introducing these two 
assumptions occupy a major portion of this work and are briefly described 
below. 



A New Algorithm for Structured Prediction 

We present an algorithm that can be trained by solving an unconstrained, 
polynomially-sized quadratic program. The resulting algorithmic framework 
is a generalisation of the classical regularised least squares regression, also 
known as ridge regression, for structured prediction. The framework can 
be instantiated to solve several machine learning problems, including multi- 
label classification, ordinal regression, hierarchical classification, and label 
ranking. We then design approximation algorithms for predicting combina- 
torial structures. We also present empirical results on multi-label classifi- 
cation, hierarchical classification and prediction of directed cycles. Finally, 
we address the scalability issues of our algorithm using online optimisation 
techniques such as stochastic gradient descent. 



8 



Chapter 1. Introduction 



Analysis of Probabilistic Models for Structured Prediction 



Maximum a posteriori estimation with exponential family models is a cor- 
nerstone technique used in the design of discriminative probabilistic clas- 
sifiers. One of the main difficulties in using this technique for structured 
prediction is the computation of the partition function. The difficulty again 
arises from the exponential size of the output space. We design an algorithm 
for approximating the partition function and the gradient of the log parti- 
tion function with provable guarantees u s ing classical results from Markov 



chain Monte C arlo theory ( Jerrum et al. , 1986 : Jerrum and Sinclair . 19961 : 



Randali 2003 ). We also design a Markov chain that can be used to sample 
combinatorial structures from exponential family distributions, and perform 
a non-asymptotic analysis of its mixing time. These results can be applied 
to solve several learning problems, including but not limited to multi-label 
classification, label ranking, and multi-category hierarchical classification. 



1.4 Thesis Outline 

The thesis is organised as follows: 

Chapter 2 serves as an introduction to structured prediction. We begin with 
a description of generative and discriminative learning paradigms fol- 
lowed by a detailed exposition of several machine learning algorithms 
that exist in the literature for structured prediction. 

Chapter 3 is a survey, including original contributions, on algorithms for 
predicting a specific combinatorial structure — permutations. 

Chapter 4 analyses existing discriminative structured prediction algorithms 
through the lens of computational complexity theory. We study the 
assumptions made by existing algorithms and identify their shortcom- 
ings in the context of predicting combinatorial structures. The study 
will consequently motivate the need to introduce two new assumptions 
— the counting and the sampling assumption. We provide several ex- 
amples of combinatorial structures where these assumptions hold and 
also describe their applications in machine learning. 

Chapter 5 proposes a new learning algorithm for predicting combinatorial 
structures using the counting assumption. The algorithm is a gener- 
alisation of the classical ridge regression for structured prediction. 

Chapter 6 analyses probabilistic discriminative models for structured pre- 
diction using the sampling assumption and some classical results from 
Markov chain Monte Carlo theory. 



Chapter 7 summarises the contributions of this thesis and points to direc- 
tions for future research. 



1.5. Bibliographical Notes 
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Chapter 2 

Structured Prediction 



In supervised learning, the goal is to approximate an unknown target func- 
tion f : X ^ y or to estimate the conditional distribution p{y \ x). There 
are essentially two ways to define a learning model and estimate its pa- 
rameters. In what is known as generative learning, a model of the joint 
distribution p{x,y) of inputs and outputs is learned, and then Bayes rule is 
used to predict outputs by computing the mode of 



P{x,y) 
E V{x,z) 



The above is also known as Bayes classifier. In discriminative learning, the 
goal is to directly est imate the par ameters of the conditional distribution 
p{y I x). According to Vapnik ( 19951 ). one should always solve a problem di- 
rectly instead of solving a more general problem as an intermediate step, and 
herein lies the common justification to model the conditional instead of the 
joint distribution. This has led to an upsurge of interest in the design of dis- 
criminative learning algorithms for str uctured predic t ion. P rominent exam- 
ples incl ude conditiona l random fields ([Laffertv et al.l . 120011). structured per- 
ceptr on (jCollinsl . liooi l. max-margi n Markov networks (iTaskar et al.l . |2003| . 
2OO5I ). and support vector machines ( Tsochantaridis et al. . 2005 ). It has now 
become folk wisdom to attack a learning problem using discriminative as op- 
posed to generativ e methods. An interest ing theoretical and empirical study 
was performed by iNg and Jordan! (j200lh who showed that while discrimi- 
native methods have lower asymptotic error, generative methods approach 
their higher asymptotic error much more faster. This means that generative 
classifiers outperform their discriminative counterparts when labeled data is 
scarce. 

A plethora of algorithms have been propose d in recent years f or pre- 
dicting structured data. The reader is referred to Bakir et al. ( 200?! ) for an 
overview. In this chapter, we discuss several of these algorithms, in par- 
ticular those that are relevant to and motivated the contributions of this 
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thesis. The main goal of a learning algorithm is to minimise an appropri- 
ately chosen risk (loss) function depending on the problem or application. 
We describe several loss functions for binary classification and their exten- 
sions to structured prediction. The algorithmic contributions of this thesis 
fall under the category of discriminative learning. We therefore review clas- 
sical approaches to discriminatively learning a classifier using perceptron, 
logistic regression and support vector machine, and show that many of the 
recently proposed structured prediction algorithms are natural extensions of 
them. 



2.1 Loss Functions 

Given an input space X, an output space y, a probability distribution P 
over X X y, a loss function £{■,■) maps pairs of outputs to a quantity that 
is a measure of "discrepancy" between these pairs, i.e., £ : y x y ^ M.. The 
goal of a machine learning algorithm is to minimise the true risk 

7^true(/)= J t{f{x),y)dP{x,y) . 
xxy 

Since we do not have any knowledge of the distribution P, it is not possible to 
minimise this risk. But given a set of training examples {(x^, drawn 
independently from Xxy according to P, we can minimise an approximation 
to the true risk known as the empirical risk 

m 

7^emp(/)=^^(/(x.),y.) . 
i=l 

In structured prediction, a joint scoring Junction on input-output pairs 
is considered, i.e., f : X xy (with an overload of notation), where the 
score is a measure of "affinity" between inputs and outputs. AnalogoTisly, 
a joint feature representation of inputs and outputs (j) : X x y ^ is 
considered. A linear scoring function parameterised by a weight vector w G 

is defined as 

/(x,y) = {w,(f){x,y)) . 

The goal of a structured prediction algorithm is to learn the parameters w 
by minimising an appropriate structured loss function. Given a test example 
X & X, the output is predicted as 

y = argmax /(x, z) = argmax {w, (f){x, z)) . (2-1) 

One of the major challenges in designing a structured prediction algorithm 
is to solve the above "argmax problem" . The difficulty arises in non-trivial 



2.1. Loss Functions 



13 



structured prediction applications due to the exponential size of the output 
space. 

In the following, we describe commonly used loss functions in classifica- 
tion and regression and extend them to structured prediction. 

Zero-One Loss 

The zero-one loss is defined as 



if y = z 

1 otherwise 



It is non-convex, non-differentiable, and optimising it is a hard problem 
in general. Therefore, it is typical to consider approximations (surrogate 
losses) to it, for instance, by upper-bounding it with a convex loss such as 
the hinge loss^. 

Squared Loss 

The squared loss is commonly used in regression problems with 3^ C R and 
is defined as 

4quare(y,^;) = {V - zf . 

The extension of this loss function for structured prediction is non-trivial due 
to inter-dependencies among the multiple output variables. However, these 
depende ncies can be renioved b y performing (kernel) principal component 
analysis ( Scholkopf et al. . 19981 ) on the output space and by subsequently 



learning separate regression models on each of the independent outputs. The 
final output can be predicted by solving a pre-image problem that maps the 
output in the transformed space back to the original output space. Thus, a 
structured prediciton problem can be re duced to regression. This technique 



is called kernel dependency estimation (j Weston et al.l . l2002l ) . 



Hinge Loss 

The hi nge loss became pop u lar w ith the introduction of support vector ma- 
chines ( Cortes and Vapnik . 19951 ). For binary classification, it is defined 



as 

4inge(y, z) = max(0, 1 - yz) 



where y G { — 1, +1} and z G M is the outp ut of the classifier. The generalisa- 

tion of hinge loss fo; structured predictio; ^Taskar et alL M :lTsochantaridis et al.l 



20051 ) is defined with respect to a hypothesis / and a training example (x, y). 



^The hinge loss is non-difTerentiable, but learning algorithms like support vector ma- 
chines introduce slack variables to mitigate this problem. Support vector machines will 
be described in the next section. 
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Let A:3^x3^— T-Mbea discrete (possibly non-convex) loss function, such 
as the Hamming loss or the zero-one loss, defined on the output space. Con- 
sider the loss function 

^max(/> (a^iy)) = A(argmax/(3;,z),y) . 

zey 

To ensure convexity, the above loss function is upper-bounded by the hinge 
loss as 

Cge(/> y)) = max [A(z, y) + f{x, z) - /(x, y)] . (2.2) 



Logistic Loss 



The logistic loss is used in probabilistic models and is a measure of the neg- 
ative conditional log-likelihood, i.e., — lnp(y | x). For binary classification 
(cf. logistic regression) it is defined as follows: 

Aog(y, z) = ln(l + exp(-yz)) . 

For structured prediction, it is defined (again w.r.t. to a hypothesis / and 
a training example {x,y)) as 



(\ogif, {x,y)) = In 



^exp{f{x,z)) 



f{x,y) 



Exponential Loss 

The exponential loss for binary classification is defined as 

4xp(y,2;) = exp(-yz) . 

As shown in Figure 12.11 the exponential loss imposes a heavier penalty on 
incorrect predictions than the logistic loss. However, this also means that 
the exponential loss i s sensitive t o labe l noise. The exponential loss for 
structured prediction ( Altun et al.l . I2OO3I ) is defined 



as 



4xp(/, {x, y)) = ^ exp [f{x, z) - f{x, y)] . 
zey 



We will revisit this loss in Chapter [5j 



2.2 Algorithms 

Perceptron 



The perceptron (Rosenblatt, 19581 ). briefly described in the introductory 
chapter, is a simple online learning algorithm. It learns a linear classifier 
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Figure 2.1: Various loss functions for binary classification. 



parameterised by a weight vector w and makes predictions according to 
y = f{x) = sgn{{'w, x)). The algorithm operates in rounds (iterations). In 
any given round, the learner makes a prediction y for the current instance x 
using the current weight vector. If the prediction differs from the true label y 
(revealed to the algorithm after it has made the prediction) , then the weights 
are updated according to w w + yx. The weights remain unchanged if 
the learner predicts correctly. If the data are linear l y separable, then the 
perceptron makes a finite number of mistakes (Block, 19621 : Novikofi . 1962 : 
Minskv and Papert . 19691 ) and therefore if the algorithm is presented with 
the training examples iteratively, it will eventually converge to the true 
solution, which is the weight vector that classifies all the training examples 
correctly. 

Theorem 2.1 {Block . 196i : Novikofi . 196i) Let (xi, yi), . . . , (xm, ym) be 
a sequence of training examples with \\xi\\ < R for all i G \m1. Suppose 
there exists a unit norm vector u such that yi{{u,Xi)) > 7 for all the exam- 
ples. Then the number of mistakes made by the perceptron algorithm on this 
sequence is at most {R/^)"^ . 

If the date is not separ able, then we have the following result due to 
Freund and Schapire ( 19991 ). 



Theorem 2.2 { Freund and Schapire . 199^) Let (xi, yi), . . . , (xm, ?/m) a 
sequence of training examples with ||xj|| < R for all i G {mj. Let u be any 
unit norm weight vector and let 'j > 0. Define the deviation of each example 

as di = max(0, 7 — yi{{u, Xj))) and define D = yYl'il 



1 "i ■ 



Then the number 
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of mistakes made by the perceptron algorithm on this sequence of examples 
is at most 

{R + Df 

The perc eptron can be extended to le arn non-linear functions using the 
kernel trick (IScholkopf and Smolal. [2Q03). The resulting algorithm called 



kernel perceptron (jPreund and Schapird . 19991 ) learns functions of the form 



/(•) = ^Cik{xi,-) 



i=l 



where k : X x X ^ M. is a. reproducing kernel function^ on the inputs 
and {cj}jg[m] s-i^s the kernel expansion coefficients. For every reproducing 
kernel, there exists a function (p : X ^ % (the high-dimensional, possibly 
infinite, feature space) such that k{x,x') = (0(x), 0(x')). Thus any learning 
algorithm whose function can be represented as a linear combination of inner 
products can use the kernel trick to avoid explicit computations of these 
inner products in the high-dimensional feature space. The kernel perceptron 
starts by setting all coefficients Cj to 0. It then operates in rounds, similar 
to the perceptron, by repeatedly cycling through the data. If the algorithm 
makes a mistake on a particular instance, then the corresponding coefficient 
is updated according to Cj q -|- y^. The algorithm stops when it classifies 
all training instances correctly. The co nvergence resu lts of the perceptron 
can be extended to the kernelised case (jCartneil . liooil '). 



Theorem 2.3 ICdrtne^ . \200d ) Let {xi ,yi), {x2, 112)1 ■■■ T ixm,ym) be a se- 



quence of training examples. Let k : X x X ^ M be a kernel function such 
that k{xi,Xi) < R for all i £ |m]. Let /*(•) = '^]LiCjk{xj, ■) be a func- 
tion that classifies all the training instances correctly. Suppose there exists 
a margin 7 such that yi X^JLi Cjk{xj,Xi) > 7 for all i £ |m] . Then the num- 
ber of mistakes made by the kernel perceptron algorithm on this sequence of 
examples is at most 



^||/*(.M,2 



7^ 



The percep tron algorithm can be extended to predict structured data 
(jCollineJ . I2OO2I ). Consider linear scoring functions on input-output pairs 
f{x,y) = {w,(l){x,y))^. In every iteration, the output structure of an in- 
stance x is determined by solving the argmax problem, y = argmax^g-y f{x, z), 



reproducing kernel fc is a function with the following two properties: (i) for every 
X £ X, the function k{x, •) is an element of a Hilbert space H, (ii) for every x £ X and 
every function /(■) G H, the reproducing property, {k{x, ■), /(•)) ~ f{^), holds. 

■^Note the use of joint feature representaion of inputs and outputs (with a slight abuse 
of notation). 
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using the current weight vector w. If this output is different from the true 
output, i.e., if y ^ y, then the weight vector is updated as follows: 

w w + y) — cj){x, y) . 



The algorithm stops when all the training instances have been predicted 
with their correct output structures. Similar to the perceptron, convergence 
results can be es tablished for structured prediction in the separable and 
inseparable cases (jCollineJ .! I2OO2I ). Let GEN(x) be a function which generates 
a set of candidate output structures for input x and let GEN(x) = GEN(x) — 
{y} for a training example (x, y). A training sequence (xi, yi), . . . , {xm, ym) 
is said to be separable with a margin 7 if there exists a unit norm vector v 
such that for all i G [m] and for all z G GEN(j;j), the following condition 
holds: (f , (j){xi,yi)) - {v, 4>{xi, z)) > 7. 



Theorem 2.4 j(Collin^ . 200i ) Let (xi, yi), (x2, 2/2), • • • , (a^m, ym) a se- 
quence of training examples which is separable with margin 7. Let R denote 



a constant that satisfies 



{Xi,Z 



< R, for all i G 



m 



and 



for all z G GEN{xi). Then the number of mistakes made by the perceptron 
algorithm on this sequence of examples is at most i?^/7^. 

For the inseparable case, we need a few more definitions. For an (xj, yi) pair, 

z&GEN{xi) {v, 4>{xi, z)) and et = max{0, 7-772^} 

Then the number of mistakes made by the 



define = (f , (/>(xj, yj))— max 

and define D^^^ = 
structured perceptron was shown bv lCollinsI (j2002l ) to be at most 



mm 

D,7 



{R + 



Logistic Regression 

Logistic regression is a probabilistic binary classifier. The probability dis- 
tribution of class label y G {—1, +1} for an input x is modeled using expo- 
nential families 

p(y I :,,^) = exp(y(0(x),t.)) 

exp(y ((/>(x), -w)) + exp(-y ((/)(x), -u;)) 

Given a set of training examples X = (xi, • • • , Xm) G ^ = (yi; ■ ' ' > Um) G y 
the parameters w can be estimated using the maximum (log) likelihood prin- 
ciple by solving the following optimisation problem: 

w = argmax [lnp(y | X, w)\ 

w 

= argmax 

w 



—y^vivi I Xi,w) 
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Observe that the loss function minimised by this model is the logistic loss. 
Often, a Bayesian approach is taken to estimate the distribution of param- 
eters 

( lYY-)- P(^^^ I ^) _ I X,w)p{w) 

^^""l"^'^^ p{X) p{X) 

and the mode of this distribution is used as a point estimate of the parameter 
vector. By imposing a Gaussian prior on w, p{w) = exp(— A||w|p), which 
acts as a regulariser with A being the regularisation parameter, a point 
estimate can be computed by maximising the joint likelihood in w and Y: 

w = argmax \lnp{w, Y \ X)] 



argmax 



^ m 

EPiVi I Xi,w) - X\\w\\ 



m 

i=l 



This technique of parameter estimation is known as maximum a posterior 
(MAP) estimation. The optimisation problem is convex in the parameters w 
and differentiable, and therefore gradient descent techniques can be applied 
to find the global optimum solution. 

Logistic regression can be extended for structured prediction with the 
class conditional distribution 

exp{{(t){x,y),w)) 



p{y I x,w) 



E exp(((/)(j;,z),ii;)) 



The denominator of the above expresion is known as the partition function 
Z{w I x) = E^g-y exp(((/)(a;, z), tt;)). Computation of this function is usually 
intractable for non-trivial structured output spaces, and depends very much 
on the features (j){x,y) and the structure of y. 

We now look at a particular structure — sequences — that motivated 
the development of one of the popular conditional probabili stic models for 



struc tured prediction — conditional random fields (CRF) (jLaffertv et al 



200lh . CRFs offer a viable alternative to HMMs for segmenting and labelel- 



ing sequences. Whereas HMMs model the joint distribution of input and 
outputs j/), CRFs mode l the conditional p{y \ x). A CRF is defined as 
follows dLaffertv et al.l . l200lh : 



Definition 2.1 Let X be a random variable over data sequences, of finite 
length I, to be labeled. Let Y be a random variable over corresponding label 
sequences, where the components Yi,i G |/] can take values from a finite 
alphabet S. Let G = (y,E) be a graph such that the vertex set V indexes 
Y, i.e., Y = (Yp)p^v Then (X,Y) is a conditional random field if, when 
conditioned on X , the random variables Yp satisfy the Markov property 

p{Yp\X,Yg,qy^p)=p{Yp\X,Y„q^p) , 

wher q ^ p means that q is a neighbour of p in G. 
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Figure 2.2: Graphical models of HMMs (left) and CRFs. An open circle 



indic ates that the variable is not generated by the model (jLaffertv et al 

2nnih . 



In the case of sequences, G is a simple chain (see Figure [22]) • The advantage 
of CRFs over HMMs is that the probability of transition between labels can 
depend on past and future observations (if available) and not only on the 
current observation. The partition function for sequences is computation- 
ally tractable using dynamic pr ogramming techniques similar to the forward - 
backward algorithm of HMMs ( Laffertv et al. . 2001 : Sha and Pereira . 20031 ). 
Furthermore, CRFs are guaranteed to converge to the optimal solution due 
to the optimisation problem being convex, whereas HMMs can only guaran- 
tee a locally optimum solution using expectation-maximisation for parame- 
ter estimation. 



Support Vector Machines 

A learning algorithm for binary classification that has attracted considerable 



interest during the past decade is the support vector machine (jBoser et al 



19921 : ICortes and Vaonikl . Il995l l . An SVM learns a hyperplane that separates 
positive and negative ex amples with a. large margin thereby exhibiting good 
generalisation abilities (jVapnikl . Il995l l. It learns a linear function /(x) = 
{w,(j){x)) and minimises the hinge loss by optimising 

^ m 

minAllwlPH > max{0, 1 — (/)(xj))} . 



i=l 



The above optimisation problem can be rewritten as 

m 

min Allu'lp + ^ E 
i=i 

s.t. : yi{w,(t){xi)) >l - ^{m} 



(2.3) 



where ^ S M"^ are the slack variables which correspond to the degree of mis- 
classification of the training instances. Non-linear functions in the original 
feature space can be learned by mapping the features to a high-dimensional 
space and by using the kernel trick. It is computationally more convenient to 



20 



Chapter 2. Structured Prediction 



optimise the Lagrangian dual rather than the primal optimisation problem 
()2.3p due to the presence of box constraints as shown below: 



min ^c^YKYc - l^c 
s.t.: 0<Q<^,ViGM 



(2.4) 



where K is the kernel matrix with entries Kij = k{xi,Xj) and y is a diagonal 
matrix with entries Ya = yi- The non-linear predi ctor can be expressed in 
the following form due to the representer theorem ( Scholkopf et al.l . I2OO1I ): 



/(•) = "^Cikixi,- 



i=l 



SVMs can be extended for structured prediction by minimisin g the s truc- 
tured hinge loss (12. 2D which was first propose d by Taskar et al. ( 20031 ). In 
structured SVMs (jTsochantaridis et al.l . l2005l ). the following optimisation 
problem is considered: 



min M\wf + ^Ylii 

s.t.: {w,(j){xi,yi)) - {w,4){xi,z)) >l 



A(2/i,2) 



(2.5) 

Observe that the slack var i ables have been rescaled. In another formulation, 
proposed by iTaskar et al.l (j2003l ) , the margin is rescaled and the resulting 
optimisation problem is 



min X\\wf + ^ E 



i=l 



S.t. : {w,(j){xi,yi)) - {w,(j){xi,z)) > A{yi,z) -ii,\/z £ y\yi,\/i G |m| 
> 0,Vi e H . 

(2.6) 

While there are exponential number of constraints in the a bove optimisation 



prob lems, it is possible to employ the cutting plane method (jTsochantaridis et al 



200,51 1 by designing an algorithm that returns the most violated constraint 
in polynomial time. The most violated constraint w.r.t. a training example 
{x, y) can be computed by solving the following loss- augmented inference 
(argmax) problems in the slack re-scaling ()2.5p and the margin re-scaling 
(|2.6|) settings respectively: 



and 



y = argmax[l - {w, </)(x, y) - (j){x, z))]A{z, y) , 



y = argmax[A(z, y) - {w, (j){x, y) - 4>{x, z))] . 



(2.7) 
(2.8) 
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Tsochantaridis et al. I (|2005l l showed that a polynomial number of constraints 



suffices to solve the optimisation problem ()2.5p accurately to a desired preci- 
sion e assuming that the algorithm that returns the most violated constraint 
runs in polynomial time. Note that even if the argmax problem is tractable, 
solving the loss-augmented argmax problem requires further assumptions on 
the loss function such as it being decomposable over the output variables. 
An example of such a loss function on the outputs is the Hamming loss. 

The optimisation problem ()2.6p can be rewritten using a single max 
constraint for each training example instead of the exponentially many in 
the following way: 

m 

min X\\w\\^ + 

w,^ i=l 

S.t.: {w,(j){xi,yi)) >max.[A{z,y) + {w,(p{x,z))] - £ {mj (2-9) 
zey 

it > 0,Vi G [m] . 

The above formulation, known as min- max fo r mulat ion, of the optimisation 
problem (|2.6p was proposed by who showed that if it 



is possible to reformulate the loss-augmented inference problem as a convex 
optimisation problem in a concise way, i.e., with a polynomial number of 
variables and constraints, then this would result in a joint and concise con- 
vex optimisation problem for the original problem (j2.9p . In cases where it is 
not possible t o exp ress the inference problem as a concise convex program, 
Taskar et al] (|2005l ') showed that it suffices to find a concise certificate of 



optimality that guarantees that y = ar:gmayiz^y[A{z,y) + {w, (p{x, z))]. In- 
tuitively, verifying that a given output is optimal can be easier than finding 
one. 

Structured SVMs can be kernelised by defining a joint kernel function 
on inputs and outputs k[{x,y), {x' ,y')] and by considering the Lagrangian 
dual of the optimisation problem ()2.6p : 

min Yl ai:,aj:,,k[{xi,z),{xj,z')]-J2^(.yi^z)f^iz 

s.t. : E < A, yi G M (2.10) 

zey 

atz>0, ViGlm], Vz G 3^ . 
The non-linear scoring function can be expressed as 

/(•,•) = aizk[{xi,z), {■,■)] ■ 

ielmi,zey 

using the representer theorem ( Scholkopf et al. . 200 ll ). 



2.3 Summary 

The main purpose of this chapter was to review classical discriminative 
machine learning algorithms, including perceptron, support vector machine 
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and logistic regression, and describe how they can be extended to pre- 
dict structured data. These extensions resulted in several recently pro- 
posed structur e d pre diction algorithms such as condit ional random fields 



(iLaffertv et all . boOlh . max-r nargin Markov networks ([Xaskar et al.l . I2OO3I . 



200a), and structured SVMs (|Tsochantaridis et alJ . 120051 1 . In Chapter d we 



will discuss the assumptions made by these algorithms in order to ensure 
efficient learning, point to their limitations in the context of predicting com- 
binatorial structures, and propose solutions to circumvent these problems. 



Chapter 3 

Predicting Permutations 



Binary classification is a well-studied problem in supervised machine learn- 
ing. Often, in real-world applications such as object recognition, document 
classification etc., we are faced with problems where there is a need to predict 
multiple labels. Label ranking is an example of such a complex prediction 
problem where the goal is to not only predict labels from among a finite 
set of predefined labels, but also to rank them according to the nature of 
the input. A motivating application is document categorisation where cat- 
egories are topics (e.g.: sports, entertainment, politics) within a document 
collection (e.g.: news articles). It is very likely that a document may belong 
to multiple topics, and the goal of the learning algorithm is to order (rank) 
the relevant topics above the irrelevant ones for the document in question. 

Label ranking is also the problem of predicting a specific combinato- 
rial structure — permutations. It is an interesting problem as it subsumes 
several supervised learnin g problems such a s multi-class, multi-label, and 
hierarchical classification ( Dekel et al. . 20031 ). This chapter is a survey of 
label ranking algorithms. 



3.1 Preliminaries 

We begin with some definitions from order theory, and describe distance 
metrics and kernels that will be used in this survey. 

A binary relation ;^ on a (finite) set S is a partial order if >- is asymmetric 
{a y y a) and transitive (aybAbyc^ayc). The pair (S, is 

then called a partially ordered set (or poset). 

We denote the set {{u,v) £ \ u y v} hy p{y) and the set of all 
partial orders over S by V^. Note that every partially ordered set (5^,^) 
defines a directed acyclic graph Gy = (S,p(;^)). This graph is also called 
as preference graph in the label ranking literature. 

A partially ordered set (S, such that Vu, v GT,: uyv\/vyu is 
a totally ordered set and >- is called a total order, a linear order, a strict 
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ranking (or simply ranking), or a permutation. A partial ranking is a total 
order with ties. 

A partial order extends a partial order >- on the same T, ii u y v ^ u v. 
An extension of a partial order ;^ is a linear extension if it is totally 
ordered (i.e., a total order is a linear extension of a partial order >- if 
\/u, V £ Ti, u y V ^ u v). A collection of linear orders realises a partial 
order >~ if Vu, v G T,,u y v <^ {\/i : u v). We denote this set by The 
dual of a partial order >- is the partial order >- with Vu, v G E : u;^?; <^ v y u. 



Distance Metrics 

Spearman's rank correlation coefficient (p) ( Spearman . 19041 ) is a non-parametric 
measure of correlation between two variables. For a pair of rankings vr and 
it' of length /c, it is defined as 

_ 6D(7r,7r') 



k{k^ - 1) 



where D{7r, n') = X^iLi(vr(i) — 7r'(i))^ is the sum of squared rank distances. 
The sum of absolute differences X]f=i K(0 ~ ^'(01 defines the Spearman's 
footrule distance metric. 

Kendall tau correlat^on coeffic^ent (r) is a non-parametric 

statistic used to measure the degree of correspondence between two rankings. 
For a pair of rankings vr and vr', it is defined as 



rid 



\k{k-l) 



where Uc is the number of concordant pairs, and Ud is the number of discor- 
dant pairs in vr and vr'. The number of discordant pairs defines the Kendall 
tau distance metric. 



Kernels 

We now define kernels on partial orders and describe their properties. 



Position kernel: Define 

k^ : VxV ^Rhyk#{y,y') = ^k{\{v G E | ?; ^ u}\,\{v G S | v ^' n}|) , 

where k is a kernel on natural numbers. 

• This function is a kernel and can be computed in time polynomial in 
S. 

• It is injective, in the sense that k^{>~,-) = k^{y',-)<^ >-=>-', for 
linear orders but not for partial orders. 
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Edge Kernel: Define 

kp -.V xV ^Rhy kp{>-, = Ip(^) n • 

• This function is a kernel and can be computed in time polynomial in 

• This kernel is injective in the sense that kp{y, •) = kp{>~' , •) >-=>-'. 

A downside of this kernel is that a ;^ 6 is as similar to 6 ^ a as it is to a ;^ c. 
However, we can overcome this problem easily. Let >- be the dual partial 
order of >-. Define 

hi>-^ >-') = ^p(^' ^0 - >-') ■ 

• This function is a kernel (the feature space has one feature for every 
pair of elements and the value of feature uv is +\/2 ]S. u>~ v, — \/2 iff 
V >- u, and otherwise). 

• It can be computed in time polynomial in 
Extension Kernel: Define 

kr-V xV ^Rhy ke{>~, = \£{>~) n ■ 

• This function is a kernel. 

• It is injective in the sense that ki{y, •) = ki{y', •) >-=>-'. 

• The kernel cannot be computed in polynomial time as counting lin- 
ear extensions (or, equivalently , computing ki{>-,>~)) is #P-complete 
( Brightwell and Winklerl . 1 19921 ). However, it can possibly be approx- 
imated as {i) the number of linear extensions can be approximated 
( Huberl . I2OO6I ) ■ and {ii) the set of linear extensions can be enumerated 
almost uniformly. 

• We have ki{y,y') = 0<^3u,v £ T, : u y v A v u. We cah such 
partial orders contradicting. 

• For non-contradicting partial orders define the partial order y 
U such that Vn, w G S : u{>~ U >~')v vV u>~' v. 



Label Ranking — Problem Definition 

Let X C M" be the input (instance) space, S = {I,-- - ,d} = {dj be a 
set of labels, and y be the output space of all possible partial orders over 
S. Let T = {(2;j, yj)}jg|m] & {X x 3^)™ be a set of training examples. Let 
Gi = (yijEi) denote the preference graph corresponding to yi, for all i £ |m]. 
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The goal of a label ranking algorithm is to learn a mapping f : X ^ y, where 
/ is chosen from a hypothesis class J-, such that a predefined loss function 
i:J-xyxy^M is minimised. In general, the mapping / is required to 
output a total order, but it is also possible to envisage settings where the 
desired output is a partial order. Let A;;f : A" x A" — R and ky : y x y ^ R. 
denote positive definite kernels on X and y, respectively. 



3.2 Learning Reductions 



Learning reductions are an efficient way to solve complex prediction prob- 
lems using simple models like (binary) classifiers as pr imitives. Such tech - 
niques hav e been applied to solve problern s like ranking (iBalcan et al.l . j2008ll. 
regre ssion ( Langford and Zadroznvl . 20051 ). and structured prediction ( Daume III 
20061 ). just to name a few. 

Label ranking c an be reduced to binary classification us ing Kesler's con 



structio n (jNilsson . 19651 ). This approach was proposed by Har-Peled et al 



(200230) under the name of constraint classification. The idea is to con- 
struct an expanded example sequence T' in which every example (x, y) £ 
M" X y with its corresponding preference graph G = {V,E) is embedded in 
^dn ^ {—1^ 1}^ with each preference {p,q) G E contributing a single positive 
and a single negative example. The Kesler mapping P is defined as follows: 



{{x(g) Op, 



9' 



(p, q)eE} <Z M.^" 
-1) I {p,q) eE}C 



x{l} 



-1} 



where Oj is a d-dimensional vector whose jth component is one and the rest 
are zeros. Let P{x, y) = P+{x, y) U P-{x, y). The expanded set of examples 
is then given by 



r = p{T) 



U P{x,y)<^ 



X{-1,1} . 



A binary classifier (linear separating hyperplane) trained on this expanded 
sequence can be viewed as a sorting function over d linear functions, each 
in R". The sorting function is given as argsortjg|^j {wj,x), where Wj is the 
j-th chunk of the weight vector w £ R' ^", i.e., w,- = (w(j -\)n_^^, • • • , Wjn)- 

A reduction technique proposed bv iFiirnkranz known as pairwise 

classification can be used to reduce the problem of multi-class prediction to 
learning binary classifiers. An extension of this technique known as ranking 
by pairwise comparison ( RPC) was proposed in (IFiirnkranz and Hiillermeieii . 
200.4 iHiillermeier et al.l . l2008l ) to solve the label ranking problem. The cen- 
tral idea is to learn a binary classifier for each pair of labels in S resulting 
in d{d — l)/2 models. Every individual model Aipq with p,q £ T, learns a 
mapping that outputs 1 p q and if q P for an example x £ X. 
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Alternatively, one may also learn a model that maps into the unit inter- 
val [0, 1] instead of {0, 1}. The resulting model assigns a valued preference 
relation to every example x G X: 



Rx{p,q) 




if p < q 
Mpq{x) \ip>q 



The final ranking is obtained by using a ranking procedure that basically 
tries to combine the results of these individual models to induce a total 
order on the set of labels. A simple ranking procedure is to assign a score 
Sx{p) = X^pT^g Rx{p-, q) to each label p and obtain a final ordering by sorting 
these scores. This strategy exhibits desirable properties like transitivity of 
pairwise preferences. Furthermore, the RFC algorithm minimises the sum 
of squared rank distances and an approximation of the Kendall tau distance 
metric under the condition that the binary models Aipq provide correct 
probability estimates, i.e., Rx{p,q) = Aipq{x) = Fr[p ol- 



3.3 Boosting Methods 



A boosting (iFreund and Schapird . 1997 ) algorithm for label ranking 



proposed by iDekel et al 



was 



^2()0i ). A label ranking function / : x S ^ M 
is learned such that for any given x € X, a total order is induced on the 
label set hy p Q ■^=^ f{x,p) > f{x,q). The label ranking function is 
represented as a linear combination of a set of L base ranking functions, i.e, 
f{x,p) = Ylf=i ^i^ii^iP)^ where {Ai};g|^] are parameters that are estimated 
by the boosting algorithm. We denote the label ranking induced by / for x 
by f{x) (with a slight abuse of notation). A graph decomposition procedure 
P, which takes a preference graph Gi = {Vi,Ei) for any Xi £ X as its input 
and outputs a set of Si subgraphs {Gi^s}s£iSij^ has to be specified as an input 
to the learning algorithm. A simple example of a graph decomposition proce- 
dure is to consider every edge e G Ei as a subgraph. Other examples include 
decomposing the graph into bipartite directed gr aph G,;^,, = (Uj^s^V j..^, Ej,,^) 
such that \Ui^s\ = 1 or |T^,s| = 1 (see Figure 2 in iDekel et al.l (120031 ) for an 
illustration). The generalised loss due to f{xi) w.r.t. Gi is the fraction of 
subgraphs in V{Gi) with which /(xj) disagrees. The generalised loss over 
all the training instances is defined as 



m ^ Si 



where 5{-,-) is a loss function defined on t he su bgraphs such as the 0-1 loss 
or the ranking loss (ISchapire and Singer . l200d l. While minimising such a 
discrete, non-convex loss function is NF-hard, it is possible to minimise an 
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upper bound given by 

6{f (xi), Gi^s) <log2{l+ ^ exp(/(xi,term(e)) - /(xi,init(e)))) . 

where init(e) (resp. term(e)) is the label corresponding to the initial (resp. 
termin al) vertex of any d irected edge e E To minimise this upper 



bound, iDekel et alJ (120031) proposed to use a boosting-style algorit hm for 



exponential models ( Lebanon and LafFertv . 2001 : Collins et al. . 20021 ) to e& 



timate the model parameters A and also proved a bound on the decrease in 
loss in every iteration of the algorithm. 

3.4 Label Ranking SVM 



Elisseeff and WestonI (I2OOII ) proposed a kernel method for multi-label clas- 



sification. A straightforward generalisation of this approach results in a 
label ranking algorithm. Define a scoring function for label p and input x as 
hp(x) = {wp, x), where Wp is a weight vector corresponding to label p. These 
scoring functions together will define the mapping / b y a sorting operation , 
i.e., f(x) = argsortjg|^j (it;j, x). The ranking loss ( Schapire and Singeil . 



200d ) w.r.t. to a preference graph G = (V,E) is defined as £{f,x,y) 



j^\{p,q) £ E s.t. hp{x) < hq{x)\. The following optimisation problem min- 
imises the ranking loss: 

d m 

mm ^ \\wj\\^ + A E lIj E C^pg 

i^'jiieM i=l 1=1 {p,q)&E, 

subject to : {wp - Wq,Xi) >1 - Cipg, V(p, q) £ Ei,yi £ |m] 

£,ipq > 0,V(p,g) e Ei,\/i e [m| , 

where A > is the regularisation parameter that trades-off the balance of 
the loss term against the regula,riser. 

Shalev-Shwartz and Singer (I2OO6I ) considered the setting where the train- 



ing labels take the form of a feedback vector 7 G M . The interpretation is 
that label p is ranked higher than label g iff 7^ > Jq- The difference 7^ — jq 
encodes the importance of label p over label q and this information is also 
used in the optimisation problem. The loss function considered in this work 
is a generalisation of the hinge-loss for label ranking. For a pair of labels 
{p, q) € E, the loss with respect to / is defined as 

^pMi^)^l) = [(7p -7g) - ihp{x) - hq{x))]+ , 

where [a]+ = max(a, 0). At the heart of the algorithm lies a decomposition 
framework, similar to the one mentioned in the previous section, that de- 
composes any given feedback vector into complete bipartite subgraphs, and 
losses are defined and aggregated over these subgraphs. This decomposition 
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framework makes the approach very general, albeit at the cost of solving 
a complex optimisation problem. Interestingly, the quadratic programming 
form ulation for multi-label class ification as proposed by Elisseeff and We- 
ston (jElisseeff and Westonl . |200l|) can be recovered as a special case of this 
approach. 



3.5 Structured Prediction 



The motivation behind using a structured prediction framework to solve the 
label ranking problem stems from the added flexibility to use arbitrary loss 
functions and kernels, in principle, on the output space. In this section, we 
let y to be the space of all total orders of the label set 



Recall the optimisation problem of structured SVMs (jTsochantaridis et al 

200i) (cf. Chapter E]): 



min A||wf + ^ E ?i 
w,^ i=i 

s.t. : {w,(j){xi,yi)) - {w,(j){xi,z)) > A{yi,z) 
> 0,Vi G [m| . 

(3.1) 

Here, ArJ'xJ'— T-Misa loss function on total orders. To handle the 
exponential number of constraints in the above optimi s atioii problem, we 
need to design a separation oracle ( Tsochantaridis et al. . 20051 ) that returns 
the most violated constraint in polynomial time. The most violated con- 
straint with respect to a training example (x, y) can be computed using the 
following optimisation problem: 

y = argmax /(x, y) + A(z, y) . 

The above loss-augmented inference probl em and the decoding problem can 
be solved using techniques described by Le and Smolal ( 20071 ). Here, the 
scoring function / takes a slightly different form. Let g{x,p; Wp) = {4>{x),Wp) 
{(j) is feature map of inputs) denote the scoring function for an individual 
label p G S parameterised by weight vector Wp. Now define the scoring 
function / for the pair (x, y) as follows: 



/(x,y;w) = ^c/(x,j)c(y)j 



(0(x),U'j)c(7/)j 



parameterised by the set w = {wj}j(zid\ of weight vectors, where c is a 
decreasing sequence of reals and c(y) denotes the permutation of c ac- 
cording to y, i.e., c{y)j = Cy(^j) for all j £ {dj. The final prediction y = 
argmaXy^y f{x,y) is obtained by sorting the scores g{x,p) of the individ- 
ual labels. This is possible due to the Polya-Littlewood-Hardy inequality 
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(|Le and Smolal . l2007l l . The decoding problem is thus solved. We now turn 



our attention to designing a separation oracle. The goal is to find 

y = aTgma^f{x,y) + A{y,z) 

d (3.2) 
= argmax ^ {(t){x),Wj)c{y)j + A(y, z) . 

For cert ain loss functi o ns th at are relevant in information retrieval appli- 
cations, Le and Smola ( 200?! ) showed that the above optimisation problem 



can be formulated as a linear assignment problem and can be solved using 
the Hungarian marriage method (Kuhn-Mungres algorithm) in 0{d?) time. 
For arbitrary loss functions, it may not be feasible to solve the optimisa- 
tion problem (j3.2p efficiently. Note that the term A(-, •) in the separation 
oracle and also in the constraint set of structured SVM specifies an output 
dependent margin. Replacing it with a fixed margin 7 (= 1) would greatly 
simplify the design of separation oracle since it reduces to a sorting operation 
as in the decoding problem. Since the optimisation problem (jS.ip can be 
kernelised in its dual form (cf. problem ()2.10p ). it allows us to use arbitrary 
kernel functions on the output space such as those described in Section 13.11 



3.6 Online Methods 



Onlin e classification and regression algorithms like perceptron (jRosenblattl . 
19581 ) typically learn a linear model f{x) = {w, x) parameterised by a weight 
vector w £ M". The algorithms operate in rounds (iterations). In round t, 
nature provides an instance to the learner; the learner makes a prediction 
using the current weight vector w^; nature reveals the true label y* of x*; 
learner incurs a loss £{(^w^,x^) ,y*) and updates its weight vector accord- 
ingly. Central to any online algorithm is the update rule that is designed 
in such a way so as to minimise the cumulativ e loss over all the iter a tions . 
In label ranking scenarios, online algori thms (jCrammer and Singer! . 
2OO5I : IShalev-Shwartz and Singed . l2007bl ) maintain a set of weight vectors 
{wj}j^^^j, one for every label in S, and the update rule is applied to each 
of these vectors. 

Online algorithms for label rank ing have been analysed using two differ- 
ent frameworks: passive- aggressive ( Crammer et al. . 20061 ) and primal-dual 
(jShalev-Shwartz and Singerl. l2007al). Passive-aggressive algorithms for label 
ranking (jCrammer and Singed . l2005l ) are based on B regman divergences and 
resul t in multiplicative and a dditive update r ules (jKivinen and Warmuthl . 
1993). A Bregman divergence (IBregmanl . is similar to a distance met- 

ric, but does not satisfy the triangle inequality and the symmetry properties. 
In every iteration t, the algorithm updates its weights in such a way that 
it stays close to the previous iteration's weight vector w.r.t. the Bregman 
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divergence, and also minimises the loss on the current input-output 

pair. Let W G M*^^" denote the set of weight vectors in matrix form. The 

following optimisation problem is considered: 

= argmin Bf{W\\W^-^) + X£{f{x^;W),y^) , 
w 

where Bp is the Bregman divergence defined via a strictly convex func- 
tion F. The choice of the Bregman divergence and the loss function re- 
sult in different update rules. Additive and multiplicative update rules can 
be derived respectively by c onsidering the following optimisation problems 
( Crammer and Singer . 20051 ): 



= argminlliy- + A£(/(x*;VF),y*) , 
w 

and 

Ty* = argminL»KL(VF||VF*"^) + A^(/(x*;Ty*),y*) , 
w 

w here Dk^. is the Kullba c k-Lieb ler divergence. The loss function s considered 



byfc rammer and Singed (|2005l ) is similar to the ones defined by iDekel et al.l 



(see also Section 13. 3p , where a preference graph is decomposed into 
subgraphs using a graph decomposition procedure, and a loss function such 
as the 0-1 loss or the ranking loss is defined on every subgraph. The loss 
incurred by a ranker W for a graph decomposition procedure T){G) is given 
as 

l{f{x-W),y)= \{{r,s)eg:{wr.,x)<{ws,x)]^%\. 
The primal-dual framework dShalev-Shwartz and Singer . 2007al ) was used 



by Shalev-Shwartz and Singer ( 2007bl ) resuting in a unifying algorithmic ap- 



proach for online label ranking. The loss function considered in this work 
is a generalisation of the hinge-loss for label ranking. The training labels 
are assumed to be a set of relevant and irrelevant labels (as in multi-label 
classification). For a given instance x & X, let Sj.CS denote the set of 
relevant labels. The hinge-loss for label ranking w.r.t. an example (a;*,S*) 
at iteration t is defined as: 

l\W';{x\y'))= max [7-(K,x*)-(tz;*,x*))]+ . 

The central idea behind the analysis is to cast online learning as an opti- 
misation (minimisation) problem consisting of two terms: the complexity 
of the ranking function and the e mpirical label-ranking los s . Th e notion 
of duality in optimisation theory ( Bovd and Vandenberghe . 20041 ) is used 



to obtain lower bounds on the optimisation problem, which in turn yields 
upper bounds on the num ber of prediction mistakes rnade by the algorithm. 
The reader is referred to (jShalev-Shwartz and Singed . l2007bl l that presents 



several update rules for label ranking, and thes e are also shown to generalis e 
other update rules such as the ones defined bv lCrammer and Singer 



32 



Chapter 3. Predicting Permutations 



3.7 Instance-based Learning 

In instance-based learning, the idea is to predict a label for a given instance 
based on local information, i.e., labels of neighbouring examples. In la- 
bel ranking, these labels are rankings (partial orders, total orders, partial 



rankings) a r id one has to use aggregation algorithms (IDwork et al 



2001 



Fagin et aP . l2004l : lAilon et all , boosi : lAilonl . boO?! : Ivan Zuvlen and Williamsonl . 

2007l l to combine rankings from neighbouring examples. Instance-based 

l earning alg o ritms for label ranking w e re proposed recently bvlBrinker and Hiillermeier 
(|2006l . l2007l ): ICheng and Hiillermeierl (j2008l ): ICheng et al.l toO^ . Let{m}„v:|n] 
denote a set of B neighbouring rankings for any given instance x € X. 
The goal is to compute a ranking y that is optimal w.r.t. a loss function 
i y x y ^ M defined on pairs of rankings. More formally, the following 
optimisation problem needs to be solved: 



B 

y = argmin V]£(y,yi) 



(3.3) 



This is a very general statement of the problem. Various aggregation algo- 
rithms, which we survey in the sequel, can be used to solve this optimisation 
problem depending on the nature of the loss function and also on the inputs 
(of the optimisation problem). 



Aggregating Total Orders 

The problem of finding an optimal ranking when the inputs in the optimi- 
sation problem (|3.3p are total orders can be fomulate d as a feedb a .ck ar c 
set problem in digraphs (specifically in tournaments) ( Ailon et al. . 20081 ) . 
A tournament is a directed graph G = {V, E) such that for each pair of 
vertices p,q ^ V, either {p,q) ^ E or {q,p) G E. The minimum feedback 
arc set (FAS) is the smallest set E' <^ E such that {V, E — E') is acyclic. 
The rank aggregation problem can be seen as special case of weighted FAS- 
tournaments; the weight Wpq of an edge {p, q) is the fraction of rankings that 
rank p before q. 

Optimising the Spearman footrule distance metric in the minimisation 
problem (|3.3p is equivalent to finding the minimum co s t max imum match- 
ing in a bipartite graph with d nodes (jPwork et al.l . I2OOII ). A 2-factor 
approxim ation algorithm w ith time complexity 0(Bd + d\ogd) was pro- 
posed by iFagin et al.l (120041). Optimising the Kendall tau distance metric 



in (1331) is NP-hard (Bartho 



appr oximation algorithms (jAilon et al 



di III et al.l. 



1989) and therefore one has to use 



20081 : Ivan Zuvlen and Williamson 



2OO7I ) to output a Kemeny optimal ranking. There exists a determinis- 
tic, combinatorial 8/5-approximatio n algorithm for aggregating total orders 
(jvan Zuvlen and Williamsonl . |2007| ). The appro ximation rat i o can be im- 
proved to 11/7 by using a randomised algorilhm Uilon et a].I .B and to 
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4/3 b y using a derterministic linear programming based algorithm (Ivan Zuvlen and Williamsonl. 
2007l^ . A polynomial time approximation scheme was proposed bv lKenvon-Mathieu and Schudv 
(|2007l ^. 



Aggregating Partial Rankings 



Aty 



jical application of this setting is multi-label ranking (jBrinker and Hiillermeier 



20071 ) where the preference graph is bipartite with directed edges between 



relevant and irrelevant labels. There exists a deterministic, combinatorial 



8 / 5-a pproximation algorithm for aggregating partial rankings (jvan Zuvlen and Williamsonl . 
2003). The running time of this algorithm is 0{d^). A slightly better approx- 
imation guarantee of 3/2 can b e obtained by using a deter r ninist ic, linear 
programming based algorithm (jvan Zuvlen and Williamsonl . l2007l ). These 
algorithms minimise the Kemeny distance between the desired output and 
the individual partial rankings. An exact method for aggregating partial 
rankings using (generalised) sum of squa red rank distance metric was pro- 
posed bv lBrinker and Hiillermeier 



Aggregating Partial Orders 

In this setting, we allow inupt labels to be partial orders and the desired 
output is a total order. To the best of our knowledge, there are no approx- 
imation algorithms to aggregate partial orders, but it is possible to reduce 
the problem to that of aggregating total orders as follows: given a partial 
order , sample a set (of some fixed cardinality) of linear extensions (jHuber . 
200^) and use existing approximation algorithms for aggregating total or- 
ders. If the desired output is a partial order and not a total order, one can 
consider the following optimisation problem: 



Under the assumption that kx{-,-) ^ and ky{-,-) > 0, and if the edge 
kernel (cf. Section 13. ip on partial orders is used, the above optimisation 
problem ca n be approximately solved us i ng the maximuni acycl ic subgraph 
algorithm ( Hassin and Rubinstein . 1994 : McDonald et al. . 20051 ). 



3.8 Summary 

Label ranking is a specific example of the learning problem of predicting 
combinatorial structures. The problem has attracted a lot of interest in 
recent years as evidenced by the increasing number of algorithms attempting 
to solve it. The main purpose of this chapter was to give an overview of 
existing literature on label ranking algorithms. While most of these are 
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specialised algorithms, we have seen in Section 13.51 that the problem can 
also be solved within the structured prediction framework using structured 
SVMs. We will revisit the problem of predicting permutations — as an 
example — in the next chapter. 



Chapter 4 

Complexity of Learning 



In Chapter [51 we discussed several discriminative structured prediction al- 
gorithms. We will now revisit some of these algorithms, try to get a deeper 
understanding of the assumptions they make to ensure efficient learning, 
and identify their shortcomings in the context of predicting combinatorial 
structures. We will then introduce two new assumptions and show that they 
hold for several combinatorial structures. These assumptions will be used 
in the design and analysis of structured prediction algorithms in subsequent 
chapters. 



4.1 Efficient Learning 

Recall that the structured loss with respect to a hypothesis h and a training 
example (x, y) is defined as ^max(^i 2/)) = A{aTgmsLX^^y h{x, z),y), where 
A:3^x3^— T-Misa discrete, non-convex loss function defined on the output 
space. To ensure convexity, it is typical to upper-bound this loss by the struc- 
tured hinge loss i^^^^^{h, {x,y)) = argmax^^y \A(z, y) + h(x, z) - h(x, y)] . 
Regularised risk minimisation based approaches ( Tsochantaridis et al.l . l2005l : 



Taskar et al.l . l2003, . ,2005. ) aim at solving the optimisation problem Q{{{xi,yi) 



i E |ml}) 



argmin \Q[h] + J2 
hen ieim] 

subject to h{xi,yi) - h{xi,z) > A{z,yi) - Vi G [m], Vz G 3^ \ 
C^ > 0, G |ml , 

(4.1) 

where A > is a regularisation parameter, ^ is a reproduding kernel Hilbert 
space with a corresponding kernel y), (x', y')], and : ^ — )• R is a 
convex regularisation function such as the squared £2 norm, || • |p. 

The major issue in solving this optimisation problem is that the number 
of constraints grows proportional to If the set y is parameterised by 
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a finite alphabet S, then the number of constraints is usuahy exponential 
in I S I . To ensure polynomial time complexity different assumptions need to 
be made, and depending on the nature of different methods are used that 
iteratively optimise and add violated constraints. We now describe these 
assumptions in decreasing order of strength. 



Decoding 

The strongest a ssumption is the existence o f a polynomial time algorithm for 
exact decoding ( Tsochantaridis et al. . 20051 ). Decoding refers to the problem 
of computing argmax^g-y /i(x, z) for a given {h, x) £ Ti x X. 



Separation 

A weaker assumption made by structured perceptron (ICoUinsl . [iooi l is the 
existence of a polynomial time algorithm for separation^ . Separation refers 
to the problem of finding for a given scoring function h, x £ X and y £ y, 
any z G y such that h{x, z) > h{x, y) if one exists or prove that none 
exists otherwise. A polynomial time algorithm for exact decoding implies a 
polynomial time algorithm for separation. 



Optimality 



An even weaker assumption is that optimality is in NP (jTaskar et al.l . |2005| ) . 



Optimality refers to the problem of deciding if for a given scoring function h, 
X G X and y £ y, it holds that y G argmax^^y h{x , z) . A polynomial time 
algorithm for separation implies a polynomial time algorithm for optimality. 

For several combinatorial structures considered in this work, there exists 
a short certificate of non-optimality (i.e., non-optimality is in NP), but there 
is no short certificate of optimality unless coNP=NP (complexity classes are 
illustrated in Figure HT]) . This implies that polynomial time algorithms for 
exact decoding and separation do not e xist. In other words, none of the 
existing structured p r edicti on algorithms dColhnsl . 120021 : iTaskar et all 1200.4 
Tsochantaridis et al. . 2005 : Taskar et al. . 20051 ) can be trained efficiently to 
predict the combinatorial structures that are of interest to us. 

Recently, there have been some attempts to use a pproximate inference 



algori thms for learning structured prediction models. iKulesza and Pereira 
(|2007l ) performed a theoretical analysis of the relationship between approx- 
imate inference and efficient learning. They showed that learning can fail 
even when there exists an approximate inference algorithm with strong ap- 
proximation guarantees and argued that, to ensure efficient learning under 



^Algorithms for decoding and separation are also referred to as inference algorithms in 
the literature. 



4.2. Hardness Results 



37 




Figure 4.1: Complexity class diagram. 



approximate inferece, it is crucial to choose compatible inference and learn- 
ing algorithms. As an example, they showed that a linear programming 
based appro ximate inferen c e algo rithm is compatible with the structured 
perceptron. iMartins et al. provided risk bounds for learning with 

relaxations of integer linear programming based inference that is common 
in natural language applications. Training str uctured SVMs with a,pproxi - 
mat e inference was co nsidered in the works of Finlev and Joachims ( 20081 ) 
and lKlein et al.l (12008 ) with mixed (negative and positive) results. The con- 
clusion of iKlein et al.l (j2008l )'s empirical study was that structured SVMs 
trained with exact inference resulted in improved per formance when com- 
pared to those trained with approximate inference. Finlev and Joachimsl 
(|2008l l considered two classes of approximate inference algorithms — under- 
generating (e.g., greedy methods) and overgenerating (e.g., relaxation meth- 
ods like linear programming and graph cuts) algorithms — and showed that 
models trained with overgenerating methods have theoretical and empiri- 
cal advantages over undergenerating methods. The aforementioned mixed 
results motivated us to consider efficient learning methods for structured 
prediction that tries to avoid using any inference algorithm, be it exact or 
approximate, during training. 



4.2 Hardness Results 

In the following, we will also be interested in the set of hypotheses potentially 
occurring as solutions to the optimisation problem ()4.ip and denote it as 
Hopt = {Q{D) \DQXxy}. 

First, we show that the assumptions described in the previous section do 
not hold for several relevant output sets. In particular, we show that they do 
not hold if the non-optimality decision problem for a given (3^, "Hopt) is NP- 
hard. This decision problem is the complement of the optimality problem 
and is defined as deciding if for a given h E Tiopt, x G X and y £ y, a z G y 
exists such that h{x,z) > h{x,y). Second, we show that for the specific 
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case of undirected cycles (route prediction), non-optimality is indeed NP- 
complete. Our hardness result gains further significance as we can also show 
that this case can indeed occur for a specific set of observations. Third, we 
turn to a class of problems for which the output forms particular set systems, 
show that in this case the assumptions decoding, separation, and optimality 
are not contained in P, and note that decoding is often hard as it corresponds 
to edge deletion problems ( Yannakakisl . Il97i 



Representation in Output Space 



As described in the previous section, state-of-the-art structured output learn- 
ing algorithms assume (at least) that deciding if an output structure with 
higher score than a given one exists is in NP. With the definitions given 
above, we formally define {y,Hopt)— Optimality as: 



5z G 3^ : h{x,z) > h{x,y)) 



Let kx '■ X xX ^"Khe, the kernel function on the input space X. We assume 
that a polynomial time computable map ijj -.y \s defined and will refer 

to the inner product under this map by ky : y,y' ^ {ip{y),ip{y')) . For the 
joint kernel of inputs and outputs, we consider the tensor product kernel 
k[{x,y), {x' ,y')] = k yjx, x')ky(y, y')^ which has found applicat ions in many 
important domains ( Jacob and Vert . 20081 : Erhan et al. . 20061 ). We refer to 
the Hilbert spaces corresponding to k x, ky, k as T-Lx-, "Hy, T-L, respectively. 

The strong representer theorem (jScholkopf et al.l . l200ll ) holds for all 
minimisers h* G ^opt of the optimisation problem ()4.ip . It shows that 
h* £ span{A;[(xi, z), (•, •)] \ i £ {mj, z £ y}. That is. 



h*{x,y)= ^ ai^k[{xi,z),{x,y)] 

ielml,zey 



^ oci^kxixi, x) {il){z),^{y)) 



X] OLi^kx{xi,x)'ilj{z),tlj{y)) . 

\ie[ml,2Gy / 

This motivates us to first consider (3^, ■0)— Optimality, defined as: 
w E span{V'(y) \ y £ y},y £ y ^ {$z £ y : {w, ij{z)) > {w, '4){y))) . 



To show that (3^, V')— Optimality is not in NP, it suffices to show that 
{y,'4))— Non-Optimality is NP-hard. Formally, we define (3^,-0)— NON- 
Optimality as: 

w £ span{?/^(y) \ y £ y},y £ y ^ {'^z £ y : {w, ip{z)) > {w, iljiy))) , 
and correspondingly (3^, "Hopt)— NON-OPTIMALITY. 
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Route Prediction — Hardness of Finding Cyclic Permutations 

We now consider the route prediction problem introduced in Section 11.21 as 
an instance for which (3^,^opt)— NON-OPTIMALITY is NP-complete and for 
which thus the assumptions made by state-of-the-art structured prediction 
algorithms do not hold. 

In the route prediction problem, each Xi represents a sequence of fea- 
tures such as an individual person, a day, and a time of the day; S'^^'^ is a 
set containing points of interest; y^^^ is the set of cyclic permutations of 
subsets of S (we are interested in cyclic permutations as we assume that 
each individual starts and ends each route in the same place, e.g., his/her 
home); and d^^^ = S x S. We represent a cyclic permutation by the set of 
all neighbours. For instance, the sequences {abc, bca, cab, cba, acb, bac} are 
equivalent and we use {{a, b}, {b, c}, {c, a}} to represent them. Furthermore, 
we define ip'^^^^^iu) = 1 if iu,v) G y, i.e., v and u are neighbours in y, and 
otherwise. 

Lemma 4.1 With all constants and functions defined as above, (y^^'^ — 
Non-Optimality is NP-complete. 

Proof The proof is given by a Karp reduction of the Hamiltonian path prob- 
lem. Let G = (y, E) be an arbitrary graph. Wlog, assume that F n [3] = 0. 
Construct a weighted graph G on the vertices of F U |3] with adjacency 
matrix 

w = {\V\-2)- i:^y%{{l, 2}, {2, 3}, {3, 1}}) + ^^y^(e) . 

Now, (J^'^y^ ^'^y'^)- Non-Optimality(u), {{1, 2}, {2, 3}, {3, 1}}) holds iff there 
is a cyclic permutation on y U [3] that has an intersection with G of size 
\V\ — 1. The intersection of any graph with a graph is a set of paths in G. 
As the total number of edges in the intersection is \ V\ — 1, the intersection 
is a path in G. A path in G with — 1 edges is a Hamiltonian path. □ 

Theorem 4.2 With all constants and functions defined as above, {y^^^ ,'H'^t)~ 
Non-Optimality is NP-complete. 

Proof We consider T^op^ for A = and construct training data and a 
function that satisfies all but one constraint if and only if the graph has a 
Hamiltonian cycle. 

For an arbitrary graph G = {V, E) let G, w be defined as above and let 
X = 2^ with kx{e,e') = |en e'|. With m = \E\ + 1, {xi \ i e l\E\j} = E, 
and Xm = E we choose the training data D as 

{({e}, {e}) \e€E}U {{E, {{1, 2}, {2, 3}, {3, 1}})} . 
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For i £ l\E\J, let Qj^ = 1/2 for z = yi and ctj^ = otherwise. For i = m, let 
c^iz = (1^1 ~ 2)/2 for z = yi and cxiz = otherwise. For i E H-El], we then 
have 

h{xi,yi) = i ^ cxjzkx{xj,Xi)ip{z),'4){yi) \ = 1 
XjelmLzGyt^yc / 

and y 7^ Hi ^ h{xi,y) = 0. Thus there are no violated constraints for i G 
and h is indeed optimal on this part of the data. Furthermore, for all 
y e 3^^y^: 

h{xm,y) = ( ^ Q;j^A;;t'(a;j,a;m)V'(^),V'(y) y = V'(y)) • 

Together with Lemma l4. II this gives the desired result. □ 
Connections to other Assumptions 

Recall the assumptions made by existing structured prediction algorithms, 
namely, decoding, separation, and optimality (cf. Section 14. ip . Define 
(yj'H)- Decoding as 

h £ T-l,x £ X ^ argmax/i(x, z) , 



{y^T-L)- Separation as 

h £n,x e x,y ey 



z for some z £ y with h{x, z) > h{x, y) 
otherwise 



and {y^T-L)— Optimality as 

h£'H,x £ X,y £yh^$z £y : h{x, z) > h{x, y) . 

Proposition 4.3 

{y^T-L)— Non-Optimality is NP-complete. 
=^ {y^'H)- Optimality is coNP-complete. 
=^ {y,'H)- Separation is coNP-hard. 
=^ {y,'H)- Decoding is coNP-hard. 

Proof The result follows immediately by observing that (1) optimality is 
the complement of non-optimality, (2) any separation oracle can be used 
to decide optimality, and (3) any algorithm for decoding can be used as a 
separation oracle. □ 
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Corollary 4.4 Unless coNP=NP, [y^^'^ Optimality is not con- 
tained in NP. Unless P=coNP, there is no polynomial time algorithm for 

1. (3^"y^?^^y^)- Separation, and 

2. (3^'=y'=,-H'=y'=)- Decoding. 

Proof The result follows immediately from Theorem 14.21 and Proposi- 
tion 1131 □ 
Lastly, the decision problem (3^,^)— Optimal- Value 

^eR,hen,xeX,yey^{^ = max{/i(x, z) \ ze y}) 

i s also of interest. Fo llowing the proof that ExACT-TSP^ is D^-complete 
( Papadimitrioul . fl993 )^ . it can be shown that {y^^^, Ti^^^)- Optimal- Value 



is D^-complete. The significance of this result is that optimal-value can be 
reduced to separation and decoding, providing even stronger evidence that 
neither of these problems can be solved in polynomial time. 

Set Systems 

We now consider set systems (S,3^'^) with = {y E 2^ | 7r(y) = 1} 
where vr : 2^ — )• {±1}, and let : y ^ {0, 1}^ be the indicator function 
ipf{y) = 1 if e S y and otherwise. An independence system is a set system 
for which y' Q y £ y^ implies y' G y^, i.e., for which vr is hereditary. 
A particularly interesting family of independence systems corresponds to 
hereditary properties on the edges of a graph G = {V, E) where E <ZTi = 
\U C V I \U\ = 2}. In this case, decoding corresponds to minimum 
edge deletion problems, which are NP-hard for several important hereditary 
graph prope r ties li ke planar, outerplanar, bipartite graph, and many more 
(|Yannakakij . ll978l ). 



Proposition 4.5 With all constants and functions defined as above, {y^ , ip'^] 
Non-Optimality is in P if and only if the minimum edge deletion problem 
corresponding to the hereditary property n is in P. 

Proof ( =^ ) For the graph G = {V, E), let w he a. function mapping a 
set F <^ E to the adjacency matrix of (y,F). We give a Turing reduction 
from the minimum edge deletion problem to non-optimality in Algorithm [TJ 
To see this, it is sufficient to observe that: (i) the while loop continues 
as long as E contains a subset that is larger than \Y\ and satisfies vr; as 



ExACT-TSP is the problem of finding a Hamiltonian cycl e of a given lenth. 
^D^, introduced bv IPapa dimitriou and Yannakakij l| 19841 '). is the class of languages 
that are the intersection of a language in NP and a language in coNP, i.e., = {Li Ci 
L2 : Li e NP,L2 e coNP}. Note that this is not the same as NP n coNP, and that 
NP,coNP C D^. 
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Algorithm 1 An algorithm for solving the minimum edge deletion problem 
for some hereditary property tt given a (non-)optimality oracle. 
Require: Graph G = {V, E) 

Ensure: A maximum subgraph of G that satisfies tt 
1: Let y ^ 

2: while {y^ NON-OPTIMALITY(l(;(£^),y) do 

3: Let F ^ E; 

4: for e G £^ \ y do 

5: if {y^, ij)^)- Non-Optimality(u;(F \ {e}), Y) then 

6: F ^F\{e} 

7: end if 

8: end for 

9: for e e -B do 

10: if Non-Optimality( I^\^|~|^^]^I~^ u;(F \ y) + w{Y n F \ {e}), y) 

then 

11: F^F\{e} 
12: end if 
13: end for 

14: Let y ^ F 

15: end while 



(w(F), V'^(y)) increases in each iteration by one, the number of iterations 
is bounded from above by |{| {w{E), il^^{Z)) \ : Z e y^}] < \E\, (ii) the first 
for loop (lines 4-8) removes edges not contained in Y from F while ensuring 
that F contains a subset that is larger than \Y\ and satisfies tt; therefore 
F\y / 0; and because of hereditary max{ I A| : A C F A7r(A)} = |y| + 1, 
and (iii) the second for loop (lines 9-13) removes edges of Y from F while 
ensuring that F still contains a subset that is larger than \Y\ and satisfies 
TT. As the removal of these edges from the adjacency matrix (vj) will shrink 
{wjipiY)), the weight of the edges in F \ y is reduced accordingly. 

( <^= ) It remains to observe that to decide whether Y is non-optimal, it 
suffices to find a maximum structure and compare its size with Y. □ 

For properties tt that are non-hereditary, two modifications of the algo- 
rithm arise: (a) The constant in line 10 may be larger than 1; its precise 
value can be found by first increasing it from c = until ^{y^ ,4'^)— NON- 

Optimality (^ '^\^|~I^'|^'~^ ^(F \ y) + tv{Y n F), y) . (b) The structure 
F itself is not necessarily contained in 3^'^; a polynomial time subroutine for 
finding a A D F such that A G y^ is needed additionally. 
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4.3 Two New Assumptions 

We have thus far discussed the assumptions mabe by existing structured 
prediction algorithms to ensure efficient learning. We have also seen, in 
the form of hardness results, that these assumptions do not hold for several 
combinatorial structures thereby exposing the limitations of existing algo- 
rithms to learn efficiently to predict combinatorial structures. We are now 
ready to introduce two new assumptions, and provide several examples of 
combinatorial structures and applications in machine learning where these 
assumptions hold. These assumptions are based on counting and sampling 
combinatorial structures and will be elucidated in the following sections. 

4.3.1 The Counting Assumption 

The major difficulty in structured output learning is to handle the exponen- 
tially many constraints in the optimisation problem (j4.ip . While successively 
adding violated constraints is feasible under several assumptions, in the pre- 
vious section we discussed cases like route prediction where none of these 
assumptions hold. 

In the following, we will first show, considering again cyclic permuta- 
tions as a concrete example, that counting the number of super-structures 
can be feasible even if there are exponentially many of them and even if 
the assumptions of decoding, separation, and optimality do not hold. The 
counting assumption is stated as follows: 

Assumption 4.1 Denote by : y ^ the finite dimensional embedding 
of the output space y . It is possible to efficiently compute the quantities 



Route Prediction — Counting Cyclic Permutations 

For a given alphabet S, we are now interested in computing |3^'^^'^|, the 
number of cyclic permutations of subsets of S. For a subset of size i there 
are i\ permutations of which 2i represent the same cyclic permutation. That 
is, there are {i — l)!/2 cyclic permutations of each subset of size i, and for 
an alphabet of size = | S | there are 



different cyclic permutations of subsets of S. 

Computing '^^^^ is simple. For each pair of neighbours, there are N — 2 
remaining vertices, and for each subset of these of size i, there are 2(i -|- 1)! 



13^1, M/ = ^V(y), and C=Y,^{y)^^{y) . 
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permutations, of which 2{i + 1) represent the same cycUc permutation: 

i=0 ^ ^ 

where 1 is the vector of all ones. 

It remains to compute C'^^'^ . Each element of this matrix is computed as 

For |e n e'l > 0, we have 

W-leUe'l , , 

i=0 ^ ^ 

and for |e fl e'| = 0, we have 

i=0 ^ ^ 

Simple Set System Cases 

We first consider the general ^-label prediction problem where y = {Z <^ 
S I |Z1 = £} with ^-.y defined as i)i{Z) = \\ii^Z and otherwise. 

This setting generalises the usual multi-class {I = 1) and multi-label (by 
summing over all i < settings. For general £ G [SJ we have (with 

\n = d), 




As special cases we have for multi-class {y = T,) that ^ = 1, C = I, and 
for multi-label {y = 2^) that * = 2^^^-^!, C = 2l^l-2l + 2\^\-'^l. 

For both of these simple cases, exact decoding is very simple. For a 
given (test) instance x, let k G with Kj = kx{xi,x). For the multi- 
class case decoding is y = argmaXggj^[Q!K]e. For the multi-label case it is 
y = {e G S I [anje > 0}. Hence, we could in this case also apply separation 
based learning algorithms. 

Simple Non-Set System Cases 

We now consider poset regression. Let y C Tj, ip : y ^ and let (S, 
be a poset. With ■ipi{z) = 1 if z y i and ipi{z) = otherwise, we have 
'^i = \{k e T, \ k y i}\ and Cij = \{k e \ k y i A k j}\. As a 
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special case, we have the ordinal regression problem where y = T, = {dj 
(with d ordinal values), tp : y ^ with 'ipi{z) = 1 if z > i and il'i{z) = 
otherwise. In this case = |S| — i and Cij = |S| — max(z,j). Note that 
hierarchical classification is also a special case of poset regression where >- 
forms a directed tree. In both cases, decoding can be done by exhaustively 
testing only |S| alternatives. 



Permutations 

Let y be the set of permutations of S and let -0 • 3^ ~^ M^^^. Then 
|3^| = and with ipi^uv){^) = ^ if u v, ''P{uv)iz) = — 1 if u, and 
0(md)(-2) = otherwise, we have = 0, and 

ii u = v' Au' = V 
\i u = u' Av = v' 
\i u = u' xor V = v' 
if u = v' xor V = u' 
otherwise 



(uv){u'v') 



-|S|! 

3 

-|S|! 



The assumptions made in the literature are unlikely to hold in this case as 
the 'without any c ycle of length < t edge deletion problem is NP-complete 
(|Yannakakisl . llOTSl j for any fixed £ > A. 



Posets and Partial Tournaments 

Consider S = |A^] x |A^]; 3^ C 2^ such that ally e y form a poset (an acyclic 
directed graph closed under transitivity) ( [A^] ,y); as well as tf^ : y ^ M-^ 
with tpuvi^) = 1 if {u,v) E z, ipuviz) = —1 if iv,u) £ z, and ^puv{z) = 
otherwise. To the best of our knowledge, no exact way to compute C^^e' is 
known. However, we can relax 3^ to 3^ C 2^ such that all y G 3^ form a 
partial tournament (a directed graph with no cycles of length two). With 
rj = N{N - l)/2 we have 13^1 = S*?, ^' = 0, and 



{uv),{u'v') 



-2- 3'^!-^ ifu = v'Au' 

+2-3l''l^i ifti 

+2 • 3l''l"2 if u 

-2-3l''l-2 if-u 

otherwise 



V 



u Av = V 
u' xor V = v' 



v' xor V = u' 



The assumptions made in the literature are unlikely to hold in this case as 
the 't ransitive digraph' edge deletion problem is NP-complete ( Yannakakid . 
1978h . 
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Graph Prediction 

Consider graphs on a fixed set of vertices, that is S = {U C [A^| | |[/| = 2} 
and a property vr : 2^ — )• i7 such as acyclicity, treewidth bounded by a given 
constant, planarity, outerplanarity bounded by a constant, clique etc. Let 
and ^|J^ be defined as in Section [4.21 For properties like clique, we can 
compute y,xp,C: 

i=2 ^ ^ i=\ene'\ ^ ' ' 

For other properties, no way to compute C might be known but we can 
always relax 3^ to = 2^. We then have § = 2l^l-i and Ce e> = 21^1-1^^^'!. 



4.3.2 The Sampling Assumption 

The sampling assumption pertains to discriminative probabilistic models for 
structured prediction. The sampling assumption is stated as follows: 

Assumption 4.2 It is possible to sample efficiently from the output space 
y exactly and uniformly at random. 

We now describe three combinatorial structures with their corresponding 
application settings in machine learning. For each of these structures, we 
show how to obtain exact samples uniformly at random. 



Vertices of a hypercube 

The set of vertices of a hypercube is used as the output space i n mult i-label 
classification problems (see, for example, Elisseeff and Weston ( 200ll )). An 
exact sample can be obtained uniformly at random by generating a sequence 
(of length d, the number of labels) of bits where each bit is determined by 
tossing an unbiased coin. 



Permut at ions 



The set of permuta tions is used as the output space in label ranking problems 
(see, for example, Dekel et al. ( 20031 )). An exact sample can be obtained 
uniformly at random by generating a sequence (of length d, the number of 
labels) of integers where each integer is sampled uniformly from the set {dj 
without replacement. 



Subtrees of a tree 

Let T = {V, E) denote a directed, rooted tree with root r. Let T' denote 
a subtree of T rooted at r. Sampling such rooted subtrees from a rooted 
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tree finds applica tions in multi-catego r y hie rarchi cal classificat i on pro blems 
as considered by Cesa-Bianchi et al. ( 20061 ) and Rousu et al. (|200fil ). We 



now present a technique to generate exact samples of subtrees uniformly 
at random. The technique comprises two steps. First, we show how to 
count the number of subtrees in a tree. Next, we show how to use this 
counting procedure to sample subtrees uniformly at random. The second 
step is accomplished along the lines of a w ell-known reduction from uniform 
sampling to exact /approximate counting ( Jerrum et al. . 19861 ). 



First, we consider the counting problem. Let v € V he a vertex of T and 
denote its set of children by 5~^{v). Let f{v) denote the number of subtrees 
rooted at v. Now, / can be computed by using the following recursion: 

f(.r) = l+ n /W- (4-2) 

cG5+(r) 

Next, we consider the sampling problem. Note that any subtree can be 
represented by a d-dimensional vector in {0,1}'^, where d = \V\. A naive 
approach to generate samples uniformly at random would be the following: 
generate a sequence of d bits where each bit is determined by tossing an 
unbiased coin; accept this sequence if it is a subtree (which can be tested 
in polynomial time). Clearly, this sample has been generated uniformly at 
random from the set of all subtrees. Unfortunately, this naive approach will 
fail if the number of acceptances (subtrees) form only a small fraction of 
the total number of sequences which is 2*^, because the probability that we 
encounter a subtree may be very small. This problem can be rectified by a 
reduction from sampling to counting, which we describe in the sequel. 

We will use the term prefix to denote a subtree T' included by another 
subtree T" , both rooted at r. Let L{T') denote the set of leaves of T' . We 
will reuse the term prefix to also denote the corresponding bit representation 
of the induced subtree T'. The number of subtrees with T' as prefix can be 
computed using the recursive formula ()4.2p and is given (with a slight abuse 
of notation) by f{T') = (nDeL(T') /(^)) ~ \^{'^')\- Now, we can generate a 
sequence of d bits where each bit u with a prefix v is determined by tossing 
a biased coin with success probability f{u)/f{v) and is accepted only if it 
forms a tree with its prefix. The resulting sequence is an exact sample drawn 
uniformly at random from the set of all subtrees. 



4.4 Summary 

The purpose of this chapter was to highlight the limitations of existing 
structured prediction algorithms in the context of predicting combinato- 
rial structures. We have shown how even the weakest assumption made by 
these algorithms — Optimality — is coNP-complete for several classes of 
combinatorial structures. We then introduced two new assumptions based 



48 



Chapter 4. Complexity of Learning 



on counting and sampling combinatorial structures. We have also seen how 

these assumptions hold for several combinatorial structures and applications 
in machine learning. As we will sec in the next chapters, these two assump- 
tions will result in (i) the design of a new learning algorithm and (ii) a new 
analysis technique for discriminative probabilistic models for structured pre- 
diction. 



Chapter 5 

Structured Ridge Regression 



In this chapter, we design a new learning algorithm for structured predic- 
tion using the counting assumption introduced in the previous chapter. The 
algorithm, as we will see in the following sections, is a generalisation of ridge 
regression for structured prediction. The algorithm can be trained by solving 
an unconstrained, polynomially-sized quadratic program, and does not as- 
sume the existence of polynomial time algorithms for decoding, separation, 
or optimality. The crux of our approach lies in the polynomial time compu- 
tation of the vector ^ = Ylz^y'^i^) ^he matrix C = "^^ey i^) 
leading to a tractable optimisation problem for training structured predic- 
tion models. 



5.1 Ridge Regression 



Given a set of training examples {xi,yi), . . . Axm., Um.) £ x R, the g oal 
of regularised least squares regression fRLRS) (jRifkin and Lippertl . 120071 ) or 
ridge regression is to find a solution to the regression problem^ via Tikhonov 
regularisation in a reproduding kernel Hilbert space. The following optimi- 
sation problem is considered: 



r = argmin|||/f + , 



1=1 



(5.1) 



where A > is the regularisatio n parameter. According to the representor 
theorem (jScholkopf et al.l . l200ll ). the solution to this optimisation problem 
can be written as 

m 

f* = ^Cik{xi,-) 

i=l 



^The same technique can be used for bi nary classifica tion. The term regularised least 
squares classfication (RLSC) was coined bv lRifkinI l|2002ll . 
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for some c G and a positive definite kernel k : X x X ^ M on the input 
space. Using this fact, the optimisation problem (|5.ip can be rewritten as 

c* = axgmm ^c^ Kc + ^\\y — KcW^ . 

The optimal solution c* can be found by solving a system of linear equations 

{K + AI)c* = y . 



RifkinI (|20o3) discusses the pros and cons of regularised least squares 



regression in comparison with SVMs. Training an SVM requires solving a 
convex quadratic optimisation problem, whereas training an RLSR requires 
the solution of a single system of linear equations. However, the downside 
of training a non-linear RLSR is that it requires storing (O(m^) space) and 
also inverting (O(m^) time) the entire kernel matrix. Also, the solution of 
an RLSR is not sparse unlike the solution of an SVM thereby demanding 
huge computations at test time. In the case of linear RLSR, the optimisation 
problem (j5.ip can be written as 



{XX^ + XI)c* 



y 



where X is the input matrix of size m x n. If n ^ m, it is possible to solve 
this syst em in 0{mv? ) operations using the Sherman-Morrison- Woodbury 
formul a (iRifkinl. bood ) . Empirically, RLSC was shown to perform as well as 



SVMs (|Rifkinl . 
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5.2 Training Combinatorial Structures 

We now present a learning algorithm for predicting combinatorial structures. 
Interestingly, the connection to ridge regression is incidental that is estab- 
lished due to manipulating a particular structured loss function in such a 
way that the resulting optimisation problem remains tractable under the 
counting assumption. 

Problem Setting 

Let X C M" be an input space and y be the output space of a combinatorial 
structure. Given a set of training examples (xi, Yi), . . . , (x^, ^m) ^ X x 2-^, 
the goal is to learn a scoring function h : X xy that, for each Xj G Af, 
orders (ranks) y in such a way that it assigns a higher score to all y G 
than to sl]1 z ^ y \ Yi. Let ■0 : 3^ — ?■ M'^ be a finite dimensional embedding of 
y with the dot-product kernel ky = {'ip{y),ip{y')) . Let kx : X x X ^ Whe 
a kernel on X. Denote the joint scoring function on input-output pairs by 
h £ Ti = Tix'S'T-ly where (8) denotes the tensor product and TixiT'iy are the 
reproducing kernel Hilbert spaces (RKHS) of kx,ky, respectively. Note that 
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the reproducing kernel of V. is then k[{x, y), {x' , y')] = kx{x, x')ky{y, y'). We 
aim at solving the optimisation problem 

/i* = argminA||/i||2+ ^ e{h,i) , 

hen iG[m] ^ ' 

where i : Tlx |m] — t- M is the empirical risk on a training instance and A > 
is the regularisation parameter. 

Loss Functions 

For each Xi we aim at ordering all elements of Yi before all elements of 3^\li. 
Note that traditional (label) ranking methods cannot be applied due to the 
huge (exponential) size of 3^ \ 1^ . We use AUC-loss as the empirical error of 
the optimisation problem ()5.2p : 

Lnc{h,i)= J2 J2 cr[Hxi,z) - h{xi,y)] /t- on 

yeY, zey\Y, ^ ' 

where a is the modified step function: a{a) = +1 if a > 0, cr{a) = if a < 0, 
and o"(a) = 1/2 if a = 0. Our definition of £auc differs from the 'area under 
the ROC curve' measure only in the sense that it is not normalised. To 
obtain a convex function we bound it from above by the exponential loss 

4xp(/i,«)=I] Yl exp [1 + /i(xi,z) - /i(xi,2/)] > 4uc(/i,«) ■ (r,A\ 

yeY,zey\Y, ' 

Despite being convex the exponential loss does not allow compact formula- 
tions in our setting, but using its second order Taylor expansion at 0, i.e., 
exp(a) w 1 + a + ^a^, does. Ignoring constants that can be accommodated 
by the regularisation parameter, we get 

^{h,i) = Y Y [hixi,z) - h{xi,y) + ^h'^{xi,z) 

y£Y, zey\Y, (5.5) 
-h{xi, z)h{xi,y) + ^h'^{xi,y)] . 

The above loss function can be seen as a generalisation of the square 
loss for structured prediction, and hence the connection to ridge regression 
is established. Similar loss func tions were conside red in previous work on 
structured prediction problems. lAltun et al. introduced the ranking 



loss ()5.3p for structured prediction and minimised an upper bound given 
by the exponential loss (|5.4p for discriminative sequence labeling. For this 
specific problem, dynamic programming was used to exp l icity c ompute the 



sums over all possibls sequences efficiently (|Altun et al.l . |2002| ) . A closely 



related loss functio n to our approximation (15.51) i s the one minimised by 



least-squares SVM (ISuvkens and Vandewalld . Il999l ) and also its multi-class 
extension ( Suvkens . 19991 ). Our approach can therefore be seen as an ex- 



tension of least-squares SVM for structured prediction problems. The main 
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reason behind deviating from the standard max-margin hinge loss is to make 
the problem tractable. As will become clear in the following sections, using 
the loss function (jS.Sp results in a polynomially-sized unconstrained optimi- 
sation problem. 



Representer 

The standard representer theorem ( Scholkopf et al. . 200ll ) states that there 



is a minimiser h* of the optimisation problem (|5.2p with 

h* = span{fc[(xi, z), (•, ■)] \ i e {m}, z e y] . 

It also holds in our case: Without loss of generality we consider h = f + g 
where f & F and g £ H with g-LF. Now h{xi, z) = {h{-, •), k[{xi, z), (•, •)]) = 
{f{;-),k[{x,,z), {;■)]) +0 = f{x,,z) as Well as \\h{;-)f = \\f{;-)f + 
\\g{-, ■)\\'^ . This shows that for each h £ Ti there is an / G for which 
the objective function of (j5.2|) is no larger. 

As usually |3^| grows exponentially with the input of our learning al- 
gorithm, it is intractable to optimise over functions in F directly. Let 
ei,. . . ,ed be the canonical orthonormal bases of M*^. We then have that 
{kx{xi,-)(g){ei,.) I i G |m],/ G {dj} spans {kx{xi, ■) (E> {ip{z), .) \ i G |m],z G 
3^}. Therefore it is sufficient to optimise over only md variables. We hence 
consider 

a* = argmin A||/„f + ^ e{f^,i) (5.6) 



with 



fa{x,z)= OLiikx{Xi,x){eull}{z)) 



Optimisation 

We now exploit the fact that the vector ^' = X^zey V'l-^^) ^-iid the matrix 
C = '^zey V'(-2^)V'~''(-2) can be computed in polynomial time for a wide range 
of combinatorial structures as shown in Section 14. 3[ li ijj : y ^ can be 
computed in polynomial time but computing C, ^ is hard, we may resort 
to approximations to C and ^. Another option would be to relax the set 
y to & superset y such that C, ^' can be computed in polynomial time for 
y. For set systems we can (in the worst case) relax to y = 2^ such that 
Cafi = 2l^l-l"U^I with ky{z,z') = \zr\z'\. 

Denote fa{xi,-) = Eie[ml,ieH /«• ^et Y be the 

matrix Y G W^^'^ such that 1^. = Yliy^Y V'^(y) K be the kernel matrix 
such that Kij = kx{xi,Xj). We then have = aKi. and = Yi.f^. We 
can now express i{fa,i) using C, and 

E/(^-^) = (/-^> ' 



5.2. Training Combinatorial Structures 



53 



^ f{xi,z) = ^f{xi,z)-^f{xi,y), 

zayXYi zay y&i 

For the simple setting where \Yi\ = 1, for all i E \m\, we have 

We have thus expressed ^(/a, i) explicity in terms of ^ and C, and comput- 
ing these quantities will depend on the specific combinatorial structure (cf. 
Section [13]). 

Let o denote the Hadamard product, let tr denote the trace operator, 
and let diag be the operator that maps a square matrix to the column vector 
corresponding to its diagonal as well as a column vector to the corresponding 
diagonal matrix. Using the canonical orthonormal basis of W^, we can write 
the optimisation problem (|5.6|) as 

argmin XtraKcx^ + -tYKa^Ca.K + ^^OLKl + ^ ||diag(yQ:K)f 
- 13^1 tvYcxK - ^""cxKdi&^iYcxK) . 

(5.7) 

We can use iterative methods like Newton conjugate gradient for training 
with the gradient 

2XaK + CaK'^ + ^l'^ K - diagi"^^ a K)K 

-\y\Y'^K + {\y\Y^ - o YaK)K 

and the product of the Hessian with vector v 

2XvK + CvK"^ + |3^|y^(I o YvK)K - ^diSig{YvK)K - Y"^ d\ag(^!"^ vK)K . 

Computing ^(/q,?) with Infinite Dimensional Output Embedding 

In the case of infinite dimensional output embeddings, we assume that 
span{A;y(z, •) | z G y} has a basis lii(-), . . . Ufc(-) with k polynomial in 
the input of our learning problem. Therefore it is sufficient to optimise over 
mk variables as spa.n{kx{xi, ■) ui{-) \ i G lml,l G [A;]} = T resuting in 
the following optimisation problem: 

a* = argmin A||/a|p + ^ i{fot,i) 
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with 

fc{x,z)= ^ aiikxixi,x)ui{z) . 

ielmliem 

We now introduce bra-ket notation for convenience and denote ky{z,-) 
by \z) and its dual by {z\. General elements of Tiy and its dual will be 
denoted by kets and bras with letters other than z and y. Note that hence 
I a) € Tiy does not imply that there exists z £ y with \a) = \z) . In particular, 
we denote fa{xi,-) = EjG[ml,«G[fel ^jO^KO by The product 

\a) {b\ is a linear operator Tiy — )• Tiy, the product {b\ \a) is just the inner 
product (a, 6). Note that for ip : y ^ with {4>{z),(j){z')) = ky{z,z') 
it holds that \z) = ip{z) and {z\ = With |^) = J2zey\^)^ ^ = 

T^zey 1^) (^1' and = Y.yeY, fi^uy) we obtain 

5]/2(x„z) = (/^|C|/^> , 
z^y 

and hence 

^{fc.,^) = \ {fl\C |/A> + {fl\ m - \y\Fl^ - Fl {{fl\ \^) - Mf^) . 



5.3 Scalability Issues 

The optimisation problem ()5.7p suffers from scalability issues similar to those 
that arise in ridge regression. Also, the gradient and Hessian computations 
involve dense matrix-matrix and matrix- vector computations that may prove 
to be detrimental for use in large-scale structured prediction problems. We 
now present a couple of techniques to address these issues. First, we refor- 
mulate the problem using a linear scoring function. This is not a serious 
restriction as we will see in the following section that it is indeed possible 
to solve an equivalent problem to ()5.7p using linear models and techniques 
from low-dimensional embeddings. Second, we show how to solve the prob- 
lem ()5.7p using online optimisatio n and RKHS update s very much similar 
in spirit to the kernel perceptron ( Freund and Schapire . 1999l l. 



5.3.1 Linear models 

Consider a linear model with scoring function f{x,y) = {w,(j){x,y)) where 
(p{x,y) = ip{y) ® X, and the following problem: 

argmin A||^i;|p + ^ i{f,i) , (5.8) 
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where the loss function is the same as (|5.5|) . If we let t(j : y — t- {0, l}'^, we 
can again express £{f, i) using C, and 



E fixi,z) 



{w, "3/ (g) Xj 



E f ixi,z) = w^lY^ (l){Xi,z)(j)^{Xi,z)]w 



Under the assumption that Ti = T-Lx ® T-Ly, the optimisation problem (|5.8p 
is equivalent to (|5.7p with kx{x,x') = {x,x'). Given any arbitrary kernel 
on the input space X and a set of training examples, we can extract a cor- 
responding low-dimensional embedding of the inputs (see Appendix |X] for 
an overview on kernel functions and low-dimensional mappings) and still be 
able to solve the problem (|5.8p . The advantage of such a formulation is that 
it is straightforward to apply online optimisat i on techniques like stochastic 



gradient descent a. i id its extensions (IZinkevichl . l2003l : IShalev-Shwartz et al 



20071 : iHazan et all , hooj ) resulting in scalable algorithms for structured pre- 
diction. 



5.3.2 Online Optimisation 

Online methods like stochasti c grad ient descent (SGD) and stochastic meta 
descent (SMD) dSchraudolphl . ES) incrementally update their hypothesis 
using an approximation of the true gradient computed from a single training 
example (or a set of them). While this approximation leads to slower con- 
vergence rates (measured in number of iterations) when compared to batch 
methods, a common justification to using these methods is that the computa- 
tional cost of each iteration is low as there is no need to go through the entire 
data set in order to take a descent step. Consequently, stochastic methods 
outperform their batch coun terparts on redundant, non-stationary data sets 
( Vishwanathan et al. . 20061 ) . The central issue in using stochastic methods 
is choosing an appropriate s tep size of the gradient descent, and techniques 



like stochastic meta descent ( Schraudolph . 1999 : Vishwanathan et al. . 20061 ) 



have emerged as powerful means to address this issue. 



Stochastic Gradient Descent in Feature Space 

It is rather straightforward to use stochastic gradient descent for linear 
models with a parameter vector w. The update rule at any iteration is 
w ^ w — rjV, where V and rj are the instantaneous gradient and step size^ 
respectively. However, for non-linear models such as kernel methods, we 
have to perform gradient descent in RKHS and update the dual parameters. 
We illustrate this with an example following the online SVM algorithm of 



^The step size is often chosen to be time-dependent. 
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Kivinen et alj (|200lh . Consider regularised risk minimisation with loss func- 



tion £:Xxyxy 



R{f) = Xn{f) + ^j:e{x„y„f{x^)) . 



m 

i=l 



The stochastic approximation of the above functional is given as 

Rstoc{f,t) = \n{f)+£{xt,ytJ{xt)) . 

The gradient of Rstocif,t) w.r.t. / is 

V fRstocif, t) = XV Mf) + V fiixt, yt, f{xt)) 

= Wf^if) + e'{xt, yt, f{xt))kx{xu •) . 

The second summand follows by using the reproducing property of T-ix to 
compute the derivate of /(x), i.e., V/ (/(•), kx{x, •)) = kx{x, •), and there- 
fore for the loss function which is differentiable in its third argument we ob- 
tain Vfi{x, y, f{x)) = i'{x, y, f{x))kx{x, •). The update rule is then given as 
f <— f — r]V Rstoc{f,t)- For the commonly used regulariser 0.{f) = 
we get 

/ ^ / - VtVtiXf + i'{xt,yt, f{xt))kx{xu •)) 
= (1 - \rit)f - 7]ti'{xt,yt, f{xt))kx{xt, •) . 

Expressing / as a kernel expansion /(•) = Cikx{xi, •), where the expan- 
sion is over the examples seen until the current iteration, we get 

ct ^ (1 - \rf)ct - vd'ixt, yt, f{xt)) 

= r]i'{xt, yt, f{xt)) for q = 

Ci = (1 — A??)cj for z 7^ t . 

We are now ready to derive SGD updates for the optimisation problem 

(ET 



Online Structured Ridge Regression 

At iteration t, let Kt = i^[t]p] G M*^*, kt = {Kiqiti).t, Vt = Yt., and let 
cxt € M'^^* be the parameter matrix. In the following, we omit the subscript 
t. The instantaneous objective of the optimisation problem ()5.7p can be 
written as ^ 

argmin A tr cti^a^ H — k"^ cx^ Cock + '^"^ (xk 
-F^(yaA;)^ - \y\ycxk - '^^ cxkyak 

with gradient V, 

2XcxK + Ccxkk^ + ^!k^ + \y\ycxky^k^ 
-\y\v^k^ - "^k^yak - ^'^aky^'k^ . 



(5.9) 



(5.10) 



5.4. Approximate Inference 
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product of Hessian with vector v, 

2\vK + Cvkk^ + \y\yvky^k^ - -^k^yvk - vky^ . (5.11) 

With step size r/, we obtain the following update rule: cx cx — rjSJ . We 
now discuss a couple of implementation aspects pertaining to the optimisa- 
tion problem ()5.9p . 



Truncating Kernel Expansion Coefficients 



As the function ()5.9p is minimised in RKHS, the parameter matrix a grows 
incrementally with time by adding a single row in every iteration. In order 
to speed up computations, we truncate all parameters that were updated 
before time r. This is justified for regularised risk minimisation problems 
because at every iteration, a,, with i <t \s shrunk by a factor (1 — Xrj) and 
therefore the contribution of old parameters in the con i putation of the kernel 
expa nsion decreases with time (jKivinen et al.l . l200ll : IVishwanathan et al.l . 
We use this simple technique to speed up computations in our ex- 



periments. I t is al so possible to apply other techniques (see, for example, 
(jPekel et al.l . l2008l )). to discard less important coefficients. Note that the 
learning rate is set to ensure < 1 — A?7 < 1. 



Step Size Adaptation 



The step size plays an important role in the convergence of st ochastic approx- 



imati on techniques and has been the focus of recent research (jVishwanathan et al 
20061 ) . We set the step size rjt 



^ where p is a parameter that has to be fine 



tuned to obtain good perfo rmance. A be t ter way to set the step size would be 



to consider SMD updates ( Schraudolph . 1999 : Vishwanathan et al. . 20061 ) 



5.4 Approximate Inference 

Thus far we have described a training algorithm for structured predic- 
tion, but have not discussed how to predict structures using the learned 
model. We now design (inference) algorithms for predicting combinato- 
rial structures. As exact inference is in general hard (cf. Chapter U]), we 
have to resort to approximations. We therefor e design approximat i on al- 



gorithms using the notion of z-approxirn ation (jHassin and Khullerl . 12001 



Ausiello and Marchetti-Spaccamela . 1980l ) with provable guarantees. 



5.4.1 Approximate Decoding 

In order to construct (predict) structures for a given input using the model 
described in Section 15. 2| we have to solve the decoding problem 

argmax/(x,y) . 
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In cases where exact decoding is not possible, we resort to approximate 
decoding methods and aim at finding y such that 

f{x,y) ^ max/(x,y) . 



z-approximation 

^-approximation algorithms are particularly suitable for optimisation prob- 



l ems i nvolving negative weights. The reader is referred tolHassin and Khuller 



(|200ll ) for examples, z-approximation was proposed bv lZemell (| 19811 ) instead 
of the more common "percentage-error" f (x , y) / maxy^y f (x , y) . y has z- 
approximation factor u for a maximisation problem if 

f{x, y) > (1 - u) max f{x, y) + v min /(x, y) . (5.12) 
An algorithm with u = is optimal whereas ar i algorithm with v = 1 is 



trivial, z-approximation has severa l advantages (i Hassin and Khullerl . 12001 



Ausiello and Marchetti-Spaccamela . 1980l ) including (z) moKy^y f {x , y) has 



the same z-approximation as niaxy^y{f{x,y) + c) where c is a constant; 
(a) maxy^y f(x,y) has the same z-approximation as minygy(— /(x, y)) (z- 
approximation for minimisation is defined by exchanging min and max in 
()5.12p ): and (iii) the z-approximation factor is invariant under replacing 
binary variables by their complement. 



Decoding Sibling Systems 

A sibling system is an output space y with a sibling function r : y ^ y 
and an output map ■0 such that \/z £ y : tp{z) + 'ip{r{z)) = c as well as 
yz £ y : (c, ^p{z)) = with fixed c. In other words, it is an output space in 
which each structure can be complemented by its sibling. 

Proposition 5.1 There is a 1/2-factor z-approximation algorithm for de- 
coding sibling systems. 

Proof Choose some y G y at random. If f{x, y) > 0, then y = y; otherwise 
y = r[y). This completes the description of the algorithm. 

We know that f{x,y) = {wx,ip{y)) for some Wx G M'^. Now Vy £ 
y ■ f{x,y) + f{x,r{y)) = {wx^Tpiz)) + (wx^ipiriz))) = {wx,c) = and 
as miiiy^y f{x, y) = — max^gy f{x, y), we have 

f{x,y) > = I- max f{x,y) + ^min/(x,y) 
^ y&y I yey 

thereby satisfying ()5.12p with u = 1/2. □ 
Notice that if r is bijective and c = 0, then also ^ = 0, thus significantly 
simplifying the optimisation problem (j5.7p . For 3^ = 2^ an example of a 
bijective sibling function is r : y i— )■ S \ y. 
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Decoding Independence Systems 

An independence system (S,3^) is an output space y such that yy £ y : 
z C y ^ z £ y and membership in y can be tested in polynomial time. 
Consider an output map : y ^ M'^I with ipu{z) = \/ fJ-{u) if n G z and 
tpui^) = otherwise, for some measure fj,. Here we find a polynomial time 
1 — (log2 |S|)/|S| factor z- approximation following ( Halldorssonl . 200d ). 



Proposition 5.2 There is a 1 — log2 factor z- approximation algo- 

rithm for decoding independence systems. 

Proof Partition S into log2 many subsets Ei of size \Ei\ < 

[log2 . Choose y = argmax^g^ f{x, y) where S = {avgma^y^Einy fi^^ v) I 
i G [[|S|/log2 We again choose Wx such that f{x,y) = {wx,tp{y)). 

Now we can find y = argmaXy^g {wx,ip{yi)) in polynomial time by ex- 
haustively testing 21-^'! < 2r^°S2lsn < |$]| + 1 alternatives in each of the 
log2 < |S| subsets. This completes the description of the algo- 
rithm. 

For the z-approximation, suppose there was y' € y with f{x,y') > 
/(x, log2 As {Ei} partitions S, we have 

f{x,y') = {wx,'ip{y')) 

i 

< rrn" ™ax iwx.il'iy' n Ei)) 

log2|S| I 
l0g2 |S| 

contradicting the assumption. Together with m.va.y^y f{x,y) < /(j;,0) = 
this proves the stated approximation guarantee. □ 



5.4.2 Approximate Enumeration 

If the non-optimality decision problem 3y £ y : f{x, y) > 9 is NP-hard then 
there is no algorithm for enumerating any set S'^ D S* = {y £ y \ f{x, y) > 9} 
of cardinality polynomial in in output polynomial time (unless P=NP). 
To see this, suppose there was an output polynomial algorithm for listing 5^. 
This algorithm can be used to obtain an output polynomial algorithm for 
listing S* with additional complexity only for testing s £ S* for all s £ S*. 
Now if the algorithm terminates in input polynomial time then it is sufficient 
to check the cardinality of S* to decide non-optimality. If on the other hand 
the algorithm does not terminate in input polynomial time, then S* can not 
be empty. 
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Hence, we consider enumerating approximate solutions, i.e., we want to 

list 

\y \ fix,y)>il-'^)m&xf{x,y) + ummf{x,y) \ . (5.13) 

If 3^ is such that listing the whole of y is hard, this precludes a general 
algorithm for the trivial case (i^ = 1) and makes it rather unlikely that an 
algorithm for > exists. We now assume that we know how to list (sample 
uniformly from) y and that the situation is similar to sibling systems: We 
have a bijective function r : y ^ y and a map ip such that \/z £ y : 
'ip{z) + jp(r{z)) = c as well as Vz G 3^ : {c,ijj{z)) = for fixed c. Then we 
can list the set (|5.13p in incremental polynomial time by a simple algorithm 
that internally lists y from y but instead of outputting y directly, it only 
outputs y if f{x,y) > f{x,r{y)) and otherwise outputs r{y). 



5.5 Empirical Results 



In all experiments, we fixed the regularisation parameter of our algorithm 

^ = \y\-i:^eiH\^i\/^- 



Multi-label Classification 



We compar ed the performance of our algorithm with the kernel method 
proposed by Elisseeff and WestonI ( 200ll ) that directly minimises the ranking 
loss (number of pairwise disagreements). We fo llowed the same exper i menta l 
set up (dataset and kernels) as described in ( Elisseeff and WestonI . boOll ). 
We used the Yeast dataset consisting of 1500 (training) and 917 (test) genes 
with 14 labels, and trained our algorithm with polynomial kernel of varying 
degree (2-9). Figure ETT] shows the results for Hamming loss and ranking 
loss. 



Hierarchical Classification 

We trained our algorithm on the WIPO-alpha patent dataset^ consisting of 
1352 training and 358 test documents. The number of nodes in the hierarchy 
is 188 with maximum depth of 3. Each document belongs to exactly one 
leaf category and hence contains no mult iple paths. The pe rformance of 
several algorithms (results are taken from ( Rousu et al. . 20051 )) is shown in 



Table ISTTl ^g/i denotes the zero-one loss in percentage, is the average 
Hamming loss per instance, and in is the hierarchical loss (average per 
instance). The hierarchical loss is based on the intuition that if a mistake 
is made at node i, then further mistakes made in the subtree rooted at 



^Available at one of the authors (|Rousu et al.l . |2006| ) webpage: 
http:/ /users. ecs.soton.ac.uk/cjs/downloads/ 
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Figure 5.1: Comparison of multi-label SVM (MLSVM) and our algorithm 
(CSOP) on multi-label classification. 

Table 5.1: Comparison of various algorithms on hierarchical classification. 



Algorithm 


^0/1 


(a 




SVM 


87.2 


1.84 


0.053 


H-SVM 


76.2 


1.74 


0.051 


H-RLS 


72.1 


1.69 


0.050 


H-M3 - 


70.9 


1.67 


0.050 


H-M^ -^^ 


65.0 


1.73 


0.048 


CSOP 


51.1 


1.84 


0.046 



i are unimportant ( Cesa-Bianchi et al. . 20061 ). Formally, for any pair of 
hierarchical labels z and y, 

iHiz,y) = Y^imz) / ^l^i{y)^^,{z) = V^,(y), j G ANCii)) , 

i=l 

where ANC{i) is the set of ancestors of i, and I{p) is 1 if p is true and 
otherwise. In this way the hierarchy or taxonomy of the problem do- 
main is taking into account. SVM denotes an SVM trained for each micro- 
label independently, H-SVM denotes an SVM trained for each microlabel 
independently and using only those samples for which the ancestor labels 
are positive, H-RLS is the hierarchical least squares algorithm described in 
(ICesa-Bianchi et al.l.l2006l ^ and H-M^ is the kernel-based algorithm proposed 
by IRousu et al.l (|2006l ) which uses the maximum margin Markov network 
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framework. The two different versions of this algorithm correspond to using 
the Hamming loss and the hierarcical loss during training. While the per- 
formance on the Hamming loss was comparable to the baseline SVM, our 
algorithm resulted in best performance on the ^q/i loss and the hierarchical 
loss. 



Dicycle policy estimation 

We experimented with an artificial setting due to lack of real world data sets 
on dicycle prediction^. We simulate the problem of predicting the cyclic tour 
of different people. We assume that there is a hidden policy for each person 
and he/she takes the route that (approximately) maximises the reward of 
the route. In the setting of dicycle prediction, the learned function is linear 
in the output space = and for testing we can check 

how well the estimated policy approximates the hidden policy in the 
test (i E |m + l,m']) set. The data is constructed as follows: (i) generate 
n matrices A^^^ £ M^^^ uniformly at random with entries in the interval 
[—1, 1] and Aul = —A^l; (ii) generate (m + m')n random numbers uniformly 
between and 1 to form the inputs Xi G M"; (iii) create the output structures 

yi « aT:gmaXy(,yJ2{u,v)ey,jeln}^ij^uv for training, that is i G In}. On the 
test set, we evaluated our algorithm by cosine similarity of the learned policy 
and the true policy: 



je[m,m+m'] \ / 



1-1 



Figure 15.21 shows a plot of the cosine measure on an experiment {m' = 
500,71 = 15, |S| = 10) for varying number of training instances. As ex- 
pected, we see that our algorithm is able to estimate the true policy with 
increasing accuracy as the number of training instances increases. The plot 
also shows the performance of structured SVM using approximate decoding 
during training. Approximate decoding is performed by randomly sampling 
a couple of cyclic permutations and using the best scoring one (the number 
of repetitions used in our experiments was 25). The results were unstable 
due to approximate decoding and the results shown on the plot are averages 
from 5 trials. For structured SVM, we experimented with several values 
of the regularisation parameter and report the best results obtained on the 
test set — though this procedure gives an advantage to structured SVM, 
our algorithm still resulted in better performance. 



''Note that we considered directed cyclic permutations as opposed to undirected ones 
described in Section 14.21 due to the fact that ^I''^^'^ = for dicycles, thereby reducing the 
number of computations involved in optimising (|5.7p . See Appendix |B] for more details on 
counting dicyclic permutations. 
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Figure 5.2: Comparison of structured SVM and CSOP on dicycle policy 
estimation. 



Stochastic Gradient Descent 

We performed a couple of experiments using artificially generated data to 
compare the performances of online and batch learning. In the first ex- 
periment, we trained a simple multi-label classification model to learn the 
identity function / : {0, 1}'* {0, l}'^. The goal was to compare batch 
and online learning, using Newton conjugate gradient (NCG) and stochastic 
gradient descent (SGD) respectively, in terms of the final objective value of 
the optimisation problem (|5.7|) and training time. We also studied the ef- 
fects of the truncation parameter r on speed and final objective. We trained 
SGD on a single pass of the data set. Figures 15.31 and 15.41 summarises the 
results for multi-label classification on an artificial data set with 5 features 
and labels. We set the truncation parameter r to 0.15 x m, where m is 
the number of training instances. We see that the final solution of SGD is 
comparable to that of NCG, and the speed up achieved by SGD is apparent 
for large data sets. The effect of r on training time and objective is shown 
in Figure 15. 4i The training time of SGD increases with r and attains the 
training time of NCG at around 19% of m. Beyond this value, we found 
that SGD was taking longer time than NCG. This underlines the effect of r 
when performing SGD updates in RKHS. 

In our second experiment, we considered dicycle policy estimation as 
before. We trained SGD and NCG on data sets of varying size from 100 
to 5000 with n = 15 and T, = 10. We fixed r to 500 kernel expansion 
coefficients. Figure 1^31 shows a plot of final objective versus training time of 
SGD and NCG on the different data sets. The plot shows that NCG takes a 
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Figure 5.3: Comparison of SGD and NCG training on multi-label classifica- 
tion. 
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Figure 5.4: Effect of truncation parameter r on training time and final 
objective value on multi-label classification with SGD and NCG. 



much longer time to attain the same final objective value as SGD. Note that 
as we perform single pass training, with fixed amounts of training instances 
NCG attains a smaller value of the objective function than SGD. However, 
as SGD can deal with much more training instances in the same time, after a 
fixed amount of time, SGD attains a smaller value of the objective function 
than NCG. 
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Figure 5.5: Comparison of SGD and NCG training on dicycle policy esti- 
mation. 

5.6 Summary 

We presented a learning algorithm for predicting combinatorial structures 
under the counting assumption introduced in Chapter U] . Our approach 
subsumes several machine learning problems including multi-class, multi- 
label and hierarchical classification, and can be used for training complex 
combinatorial structures. As for most combinatorial structures considered 
in this paper the inference problem is hard, an important aspect of our 
approach is that it obviates the need to use inference algorithms, be it exact 
or approximate, for training. 

We have also seen how to train a linear model using the counting as- 
sumption. Under some reasonable assumptions, a non-linear model can be 
approximated using a linear model using techniques from metric embedding 
theory. Furthermore, we addressed the scalability issues of our approach 
by presenting an online learning algorithm using stochastic gradient descent 
that can update model parameters (kernel expansion coefficients) in RKHS. 

For prediction, inference can naturally not be avoided. Therefore, we 
have to rely on approximation algorithms described in Section [5.41 We note 
that it is non-trivial to design approximation algorithms for the decoding 
problem of the combinatorial structures considered in this work. Indeed, 



(Guruswami et al. 


. 20081 


(Biorklund et al.. 


2004). 



directed cycles 

While we have seen how to minimise a squared loss function, it would 
be interesting to train a probabilistic model by minimising the negative log- 
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likelihood a la conditional random fields (cf . Section 12. 2|) . This will be the 
focus of the next chapter. 



Chapter 6 

Probabilistic Structured 
Prediction 

Maximum a posteriori (MAP) estimation with exponential family models 
is a fundamental statistical technique for designing probabilistic classifiers 
(cf. logistic regression). In this chapter, we consider MAP estimators for 
structured prediction using the sampling assumption introduced in Chap- 
ter HI i.e., we concentrate on the case that efficient algorithms for uniform 
sampling from the output space exist. We show that under this assumption 

(i) exact computation of the partition function remains a hard problem, and 

(ii) the partition function and the gradient of the log partition function can 
be approximated efficiently. The main result of this chapter is an approx- 
imation scheme for the partition function based on Markov chain Monte 
Carlo theory. We also design a Markov chain that can be used to sample 
combinatorial structures from exponential family distributions given that 
there exists an exact uniform sampler, and also perform a non- asymptotic 
analysis of its mixing time. 

6.1 Probabilistic Models and Exponential Fami- 
lies 

Let Xxyhe the domain of observations and labels, and X = (xi, . . . , Xm) G ^ 
Y = {yi, . . . ,ym) G y"^ be the set of observations. Our goal is to estimate 
y I X using exponential families via 

p{y \ x,w) = exp(((/)(x, y),'w) — In Z{'w \ x)) , 

where cp^x, y) are the joint sufficient statistics of x and y, and Z{w \ x) = 
^^g-y exp{{(j){x, y),w)) is the partition function. We perform MAP parame- 
ter estimation by imposing a Gaussian prior on w. This leads to optimising 
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the negative joint likelihood in w and Y: 
w = argmin [— In p{w, Y \ X)] 



argmin 

w 



X\\w\\'^ + — ^[In Z{w I Xi) - {(j){xi, yi),w)] 



(6.1) 



where A > is the regularisation parameter. We assume that the £2 norm of 
the sufficient statistics and the parameters are bounded, i.e., ||0(x,?/)|| < R 
and \\w\\ < B, where R and B are constants. Note that it is possible to 
upper bound the norm of the parameter vector w as shown below. 

Proposition 6.1 The norm of the optimal parameter vector w is bounded 
from above as follows: 

MI<V^- 

Proof Consider any (x, y) £ x y. Denote by £{w, x, y) the loss function, 
where i{w^x^y) = In Z{w \ x) — {(l){x , y) , w) > 0, and note that i{0,x,y) = 
ln|3^|. Let F{'w) = — lnp{w,Y \ X). The true regularised risk w.r.t. w and 
an underlying joint distribution D on X x y is 

E(^,^y^^D[i{w,x,y)]+X\\wf < F{0) = ln\y\ . 

This implies that the optimal solution w of the above optimisation problem 
lies in the set {w : ||w|| < y^ln IJ'I/A}. □ 
The difficulty in solving the optimisation problem (j6.ip lies in the com- 
putation of the partition function. The optimisation is typically performed 
using gradient descent techniques and advancements thereof. We therefore 
also need to compute the gradient of the log partition function, which is 
the first order moment of the sufficient statistics, i.e., Vmln Z(w \ x) = 

^y^p(y\x,w)[H^,y)]- 

Computing the log partition function and its gradient are in general NP- 
hard. In Section 16. 2t we will show that computing the partition function 
still remains NP-hard given a uniform sampler for y. We therefore need to 
resort to approximation techniques to compute these quantities. Unfortu- 
nately, application of concentration inequalities do not yield approximation 
guarantees with polynomial sample size. We present Markov chain Monte 
Carlo (MCMC) based approximations for computing the partition function 
and the gradient of the log partition function with provable guarantees. 
There has been a lot of work in applying Monte Carlo algorithms using 
Markov chain simulations to solve #P-complete counting and NP-hard op- 
timisation problems. Recent developments include a set of mathematical 
tools for analysing the rates of convergence of M arkov chains to equilibrium 
(see ( Randall . 20031 : Jerrum and Sinclair . 19961 ) for surveys). To the best of 



our knowledge, these tools have not been applied in the design and analysis 
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of structured pred iction probleni s , but have been referred to as an important 
research frontier Undri.. etall B for MCMC based machine learnine 
problems in general. 



6.2 Hardness of Computing the Partition Func- 
tion 

We begin with a hardness result for computing the partition function. Con- 
sider the following problem: 

Definition 6.1 Partition.- For a class of output structures y over an 
alphabet Ti, an input structure x £ X, a polynomial time computable map 
Tp : y ^ M'^, and a parameter w, compute the partition function Z{w). 



We now show that no algorithm can efficiently solve Partition on the class 
of problems for which an efficient uniform sampling algorithm exists. To 
show this, we suppose such an algorithm existed, consider a particular class 
of structures, and show that the algorithm could then be used to solve an 
NP-hard decision problem. We use that (a) cyclic permutations of subsets 
of the alphabet S can be sampled uniformly at random in time polynomial 
in and (b) there is no efficient algorithm for Partition for the set of 
cyclic permutations of subsets of the alphabet S with ipuviu) = = 1 if |u, v} £ y 
and otherwise. Here (a) follows from Sattolo's algorithm ( Sattold . Il986l ) 



to generate a random cyclic permutation. To prove (b), we show that by 
applying such an algorithm to a multiple of the adjacency matrix of an 
arbitrary graph and comparing the result with |Sp we could decide if the 
graph has a Hamiltonian cycle or not. 



Theorem 6.2 Unless P=NP, there is no efficient algorithm for Partition 
on the class of problems for which we can efficiently sample output structures 
uniformly at random. 

Proof Consider the output space of undirected cycles over a fixed set of 
vertices S, i.e., 3^ = IJ^^^ cyclic_permutations(C/), for an input x £ X. Let 
Tp : y ^ M^^^ with Tpuviy) = 1 if {n, f } G y and otherwise. 

Suppose we can compute \iiZ{w) = In ^^^-y exp( (■;/'(?/), tw)) efficiently. 
Given an arbitrary graph G = iV^E) with adjacency matrix u), let S = y 
and w = id X ln{\V\l x \V\). We will show that G has a Hamiltonian cycle 
if and only iflnZ{w) > \V\ x ln(|y|! x \V\). 

Necessity: As the exponential function is positive and In is monotone 
increasing, it follows that In Z{w) > \V\ x ln(|y|! x \V\). 
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Sufficiency: First, observe that 13^1 < |F|! x Suppose G has no 

Hamiltonian cycle. Then 

In Z{w) < \n[\y\ X exp[(|y| - 1) x \n{\V\\ x \V\)]] 
= ln|:^| + (|y|-l)xln(|y|!x |F|) 
< |F| xln(|y|! X |F|) . 

This completes the proof. □ 
We are interested in the class of problems for which sampling uniformly 
at random is easy, and cyclic permutations is one example of these. The 
above result shows that computing the partition function is hard even if we 
restrict the problem to this class. Essentially, it transfers the general NP- 
hardness result of computing the partition function to the restricted class 
of problems that we are interested in. In the following section, we show 
how to approximate the partition function given that there exist efficient 
algorithms for uniform sampling. 



6.3 Approximating the Partition Function Using 
Uniform Samplers 

As a first step towards approximating the partition function, let us consider 
using concentration inequalities. If we can sample uniformly at random from 
y, then we can apply HoefFding's inequality to bound the deviation of the 
partition function Z{w \ x) from its finite sample expectation Z(w \ x). Let 
S denote the sample size. Then 

^ \y\ 

Z{w I x) = ^ — [exp{{(p{x,yi),w))] . 

i=l 

Unfortunately, the bound obtained from using Hoeffding's inequality is not 
useful due to its dependence on the size of the output space |3^|. We now 
present an algorithm that is a fully-polynomial randomised approximation 
scheme for computing the partition function. 

Definition 6.2 Suppose f : P ^ M"*" is a function that maps problem in- 
stances P to positive real numbers. A randomised approximation scheme 

for P is a randomised algorithm that takes as input an instance p £ P and 
an error parameter e > 0, and produces as output a number Q such that 

Pr[(l-e)/(p)<Q<(l + e)/(p)]>^ . 

A randomised approximation scheme is said to be fully polynomial (FPRAS) 
if it runs in time polynomial in the length of p and 1/e. 
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We exploit the intimat e connection between counting and sampling prob- 
lems ( Jerrum et al. . 19861 ) to approximately compute the partition function 



using sampling algorithms. The technique is based on a reduction froin 
counting to sampling. The standard approach (jJerrum and Sinclaiil . [l99fil l 



is to express the quantity of interest, i.e., the partition function Z{w \ x), 
as a telescoping product of ratios of parameterised variants of the partition 
function. Let = /3o < /3i • • • < /3/ = 1 denote a sequence of parameters also 
called as cooling schedule and express Z{w | x) as a telescoping product 

Z{w\x) ^ ZiPj.^w b) ^ . . . ^ ZiPiw I X) ^ ^^^^^ ^ 



Z{fii_iw I x) Z{Pi_2W I x) Z{Pow 

Define the random variable fi{y) = exp[(/3j_i — y), w)] (we omit the 

dependence on x to keep the notation clear), for all is p], where y is chosen 
according to the distribution tt^^ = p{y \ x,l3iw). We then have 

T? f \^ \(Pi \ \i exp[A {(t){x,y),w)] 

f-'^ft ^ Z[l3iW I x) 

_ Z{Pi_iw I x) 
Z{/3iW I x) ' 

which means that fi{y) is an unbiased estimator for the ratio 

_ Z{l3i_iw I x) 
^* Z{PiW I x) 

This ratio can now be estimated by sampling using a Markov chain ac- 
cording to the distribution tt^- and computing the sample mean of /j. The 
desideratum is an upper bound on the variance of this estimator. Having 
a low variance implies a small number of samples S suffices to approximate 
each ratio well. The final estimator is then the product of the reciprocal of 
the individual ratios in the telescoping product. 

We now proceed with the derivation of an upper bound on the vari- 
ance of the random variable /», or more precisely on the quantity Bi = 
Var/j/(E /j)^. We first assume that Z{j3QW \ x) = \y\ can be computed in 
polynomial time. This assumption is true for all combinatorial structures 
consider in this chapter. If it is not possible to compute |3^| in polynomial 
time, then we can approximate it using the sam e machinery described in this 
section. We use the following cooling schedule ( Stefankovic et al. . 20091 ): 



l=p\R\\w\W- /3,. = -^, VjgP-11, 

where p is a constant integer > 3, i.e., we let the cooling schedule to be of 
the following form: 

12 3 p[R\\w\\\ 
q q q q 
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where q = pR\\w\\ (w.l.o.g. we assume that is non- integer) . Given 

this cooHng schedule, observe that exp(— < fi < exp(l/p), which fol- 
lows from the definition of the random variable fi, and also that 

exp(-l/p) <Efi = pi< exp(l/p) . 

We are now ready to prove the bound on the quantity Bi. 



roposition 6.3 Bi = -^y^ < exp(2/p), V i G 



Pre 

We first need to prove the following lemma. 
Lemma 6.4 exp{l/p) — I < Pi < exp(— + 1. 

Proof a < 6 =^ exp(a) — exp(— a) < exp(5) — exp(— 5) as the exponen- 
tial function is monotone increasing. Thus a < 1/3 =^ exp(a) — exp(— a) < 
exp(l/3) — exp(— 1/3) < 1. Setting a = 1/p with p > 3 and using the fact 
that exp(— 1/p) < Pi < exp{l/p) for all i S |/] proves the lemma. □ 

Proof (of Proposition 16. 3p Consider pi > exp(l/p) — 1 > /i — 1. This im- 
plies fi — Pi < I- Next, consider pi < exp(— 1/p) + 1 < /i + 1. This im- 
plies fi — Pi > —1. Combining these, we get \fi — Pi\ < 1, which implies 
Var/j < 1, and therefore Var/i/(E/i)2 < exp(2/p). □ 

Equipped with this bound, we are ready to design an FPRAS for ap- 
proximating the partition function. We need to specify the sample size S in 
each of the Markov chain simulations needed to compute the ratios. 



Theorem 6.5 Suppose the sample size S = [65e-2Zexp(2/p)] and suppose 
it is possible to sample exactly according to the distributions n^., for all 
i £ m, with polynomially bounded time. Then, there exists an FPRAS with 
e as the error parameter for computing the partition function. 



Proof The proof uses standard techniques described in (|Jerrum and Sinclair 



199^). Let be a sequence of S independent copies of the ran- 



dom variable fi obtained by sampling from the distribution tt^-, and let 
Xi = Yfj=i^\^'' be the sample mean. We have EXj = E/j = p^, and 
VarXj = 5'~"'^Var/j. The final estimator p = Z{w \ x)~'^ is the random 
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variable X = ni=i -^i with EX = nj=i Pi — P- Now, consider 
VarX _ VarXiXa - X, 

(EXy ~ (EXiEX2---EX;)2 

E(Xi'X2' • • • Xi^) - [E(XiX2 • • • Xi)]^ 



(EXiBXi-'-BXi)^ 

A. I rj yv2 * * * xIj 

(EXiEX2---EX02 ~ 



< 



< 




exp 



5 y 

< eV64 , 

where the last inequality follows by choosing 5 = [65e"^/ exp(2/p)] and 
using the inequality exp(a/65) < 1 + o/64 for < a < 1. By applying 
Chebyshev's inequality to X, we get 

^ r/i IN / , .N 1 16 VarX 1 
Pr[(|X-H)>(6/4)p]<^^^<-, 

and therefore, with probability at least 3/4, we have 

(i-i)p<x<(i + i)p. 

Thus, with probability at least 3/4, the partition function Z{w \ x) = X~^ 
lies within the ratio (1 it e/4) of p~^. Polynomial run time immediately 
follows from the assumption that we can sample exactly according to the 
distributions vr^. in polynomial time. □ 
We have shown how to approximate the partition function under the 
assumption that there exists an exact sampler for y from the distribution 
vr^. for all i G |/]. A similar result can be derived by relaxing the exact 
sampling assumption and is proved in Appendix lC.il In fact, it suffices to 
have only an exact uniform sampler. As we will see in Section 16. 6| it is 
possible to obtain exact samples from distributions of interest other than 
uniform if there exists an exact uniform sampler. 
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6.4 Approximating the Partition Function Using 
Counting Formulae 

In this section, we describe how to approximate the partition function using 
counting formulae (cf. Section 14. 3p . which obviates the need to use the 
sophisticated machinery described in the previous section. However, the 
downside of this technique is that it is not an FPRAS and works only for a 
specific feature representation. 

Consider output spaces y with a finite dimensional embedding ip : y ^ 
{0, +1}'^, and an input space X C M". Define the scoring function f{x, y) = 
{w, (j){x,y)) , where (p{x,y) = i^{y) x. Suppose that the size of the output 
space 13^1, the vector * = Yl^y^y 4>{y), and the matrix C = Yly(^y '^{u)'^'^ {u) 
can be computed efficiently in polynomial time. We first observe that given 
these quantities, it is possible to compute "^y^y f{x,y) and J2yey f'^i^^v) 
(as in the linear model described in Section [5.3|) . and these are given as 

'^f{x,y) = {w,"^ 0x) , 

y&y 

/^(x, y) = {C ® xx~^)w . 

y&y 

We then consider the second order Taylor expansion of the exp function at 
0, i.e., texp(a) = 1 + a + exp(a) and write the partition function as 

Z{w I x) = ^exp[/(x,y)] 

y&y 

^Y,'^ + f{x,y) + f\x,y) 

= \y\ + {w,'^ (g) x) + w^{C xx^)w . 

At first glance, one may argue that using the second order Taylor ex- 
pansion is a crude way to approximate the partition function. But observe 
that in a restricted domain around [—1,1], the second order Taylor expan- 
sion approximates the exp function very well as illustrated in Figure 16.11 
In order to exploit this property, we have to constrain the scoring function 
f{x,y) to lie in the range [—1, 1]. But from Proposition 16. H we know that 
\\w\\ < Y^ln |3^|/A. A direct application of Cauchy-Schwarz's inequality gives 
us the following bound on the scoring function: \f{x,y)\ < R^ln\y\/X. An 
immediate consequence is a bound that relates the regularisation parameter 
and the scoring function. 

Proposition 6.6 A > i?^ In \y\ \f{x, y)\<l. 
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Figure 6.1: Exponential function and its second Taylor approximation. 



6.5 Approximating the Gradient of the Log Par- 
tition Function 



The optimisation problem ()6.ip is typically solved using gradient descent 
methods which involves gradient-vector multiplications. We now describe 
how to approximate the gradient-vector multiplication with provable guar- 
antees using concentration inequalities. Let 2 be a vector in M" (where n 
is the dimension of the feature space with bounded £2 norm, i.e., 

\\z\\ < G, where G is a constant. The gradient-vector multiplication is given 
as 

{V^lnZ{w\x),z) = E [{^{x,y),z)] . 

y~p{y\x,w) 

We use Hoeffding's inequality to bound the deviation of (V^lnZ(w | x),z) 
from its estimate {d{w \ x),z) on a finite sample of size S, where 

1 ^ 

d{w I x) = - , 

i=i 

and the sample is drawn according to p{y \ x,w). 

Note that by Cauchy-Schwarz's inequality, we have | {4>{x, yi), z) \ < RG, 
for all i G [5]. Applying Hoeffding's inequality, we then obtain the following 
exponential tail bound: 

Pr(| (V^ In Z{w I x) — d{w | x), z) | > e) < 2 exp 



( — ) 
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For online optim isation methods like stochastic gradient descent and ad 



2007; Shalev-Shwartz et al.. 2007; Ratliffetal 



vance ments thereof (iHazan et al. 
20071 ). the optimisation problem is solved using plain gradient descent, and 



therefore it might be desirable to approximate the gradient and analyse the 
effects of the approximati on guarantee on fa ctors such as rates of conver- 

I2OO7I ). In Appendix IC.21 we show 



gence and regret bounds (jRatliff et al 



how to approximate the gradient of the log partition function using the ma- 
chinery described in Section \6.3\ i.e., using the reduction from counting to 
sampling. 



6.6 Sampling Techniques 

We now focus on designing sampling algorithms. These algorithms are 
needed (i) to compute the partition function using the machinery described 
in Section [631 and (ii) to do inference, i.e., predict structures, using the 
learned model by solving the optimisation problem argmaXy^y p{y \ x,w) 
for any x € X. Sam pling algorithms can b e use d for optimisation using the 
Metropolis process ( Jerrum and Sinclair . 19961 ) and p ossibly other meth- 
ods l ike simulated annealing for convex optimisation (jKalai and Vempalal . 
20061 ). Note that these methods come with provable guarantees and are not 
heuristics. 



6.6.1 Basics of Markov Chains 

We start with some pr eliminaries on Markov chai n s. The expositi o n mostly 
follow s the articles by Jerrum and Sinclair ( 1996 '); Jerrum ( 19981 ); Randalll 
(|2003l ). and the lecture notes of lVieodal (|Fall 2006l l. 

Let denote the state space of a Markov chain SOT with transition prob- 
ability matrix P : Q x Q ^ [0,1]- Let P^{u,-) denote the distribution of 
the state space at time t given that u G 1^ is the initial state. Let vr denote 
the stationary distribution of A Markov chain is said to be ergodic if 
the probability distribution over 0, converges asymptotically to vr, regard- 
less of the intial state. A Markov chain is said to be (a) irreducible if for 
all u,v S Q, there exists a t such that P*{u,v) > 0, and (b) aperiodic if 
gcd{P(u, n) > 0} = 1, for all n € il. Any finite, irreducible, aperiodic 
Markov chain is ergodic. A Markov chain is said to be time-reversible with 
respect to tt if it satisfies the following condition also known as detailed 
balanced equations: 

'7t{u)P{u,v) = 'ir{v)P{v,u), \/u,v G 17 . 

The mixing time of a Markov chain is a measure of the time taken by the 
chain to converge to its stationary distribution. It is measured by the total 
variation distance between the distribution at time t and the stationary 
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distribution. The total variation distance at time t is 

||P*,7r||to =max^ -7r(u)| . 

For any e > 0, the mixing time T(e) is given by 

T(e) = min{t : ||P*',7r||to < e, V t' > t} . 

A Markov chain is said to be rapidly mixing if the mixing time is bounded 
by a polynomial in the input and lne~^. We now describe techniques to 
bound the mixing time of Markov chains. 

Canonical Paths 

Let 9Jt be an ergodic, time-reversible Markov chain with state space 0,, tran- 
sition probabilties P{-, ■) and stationary distribution vr. Given that 9Jt satis- 
fies the detailed balanced condition, we may view 9Jt as an undirected graph 
with vertex set Q and edge set E = {{u,v} G J7 x J7 : Q(u,v) > 0}, 
where Q{u,v) = tt{u)P{u,v) = Tr{v)P{v,u). 

For every ordered pair (u, v) E x $7, a canonical path from u to w in 
the graph corresponds to a sequence of legal transitions in 9Jl that leads from 
the initial state u to the final state v. Let F = {7™ : u,v G Q} he the set 
of all canonical paths. In order to obtain good bounds on the mixing time, 
it is important to choose a set of paths F that avoids the creation of "hot 
spots:" edges that carry a heavy burden of canonical paths. The degree to 
which an even loading has been achieved is measured by a quantity known 
as congestion, which is defined as follows: 



p(F)=max-^ V 7r(M)7r(v)|7^ 
^ Q{e) ^ 



e 



where the maximum is over oriented edges e of (17, E) and \juv\ denotes the 
length of the path juv Intuitively, we expect a Markov chain to be rapidly 
mixing if it admits a choice of paths F for which the congestion p( T) is not 



too la rge. This intuition is formalised in the following result due to [Sinclair 



(|1992). 



Theorem 6.7 LSindaAlwai ) Let 9Jt be an ergodic, time-reversible Markov 



chain with stationary distribution vr and self-loop probabilities P{v,v) > 1/2 
for all states v G Let T be a set of canonical paths with maximum edge 
loading p = p{T) . Then the mixing time of 5!Jt satisfies 



for any choice of initial state u G O. 
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Coupling 



A coupling is a Markov process (Xt, Yt)'^Q on Q x Q sucli tliat eacli of the 
processes Pt and Qt, considered in isolation, is a faithful copy of Tl, i.e., 

Pr(Xt+i =x' \Xt=xAYt = y) = P{x, x') 
FiiYt+i =y' \Xt = xAYt = y) = y') , 



and also 



Xt = Yt^ Xt 



+1 



Yt 



t+i 



If it can be arranged that the processes [Xt) and (It) coalesce rapidly in- 
dependent of the initial states Xq and Yq, we may deduce that the Markov 
chain 9Jt is rapi dly mix i ng. T he key result is what is called the "Coupling 
Lemma" due to Aldous (1 19831 ). 



Lemma 6.8 l Aldout . 198^ .) (Coupling lemma) Suppose VJl is a count- 
able, ergodic Markov chain with transition probabilities P{-, •) and let {Xt, Yt 
be a coupling. Suppose t : (0, 1] — t- N is a function such that Pr(Xt(e) 7^ 
^t(e)) < e, for all e £ (0, 1], uniformly over the choice of initial state {Pq, Qq). 
Then the mixing time r(e) of M is bounded from above by t{e). 



Path Coupling 

Although coupling is a powerful technique to bound the mixing time, it may 
not be easy to measure the expected change in distance be tween two arbi- 



trary configurations. The idea of path coupling introduced bv lBublev and Dver 



(jl997l ) is to consider only a small set of pairs of configurations U Q ^ x ^ 
that are close to each other w.r.t. a distance metric (5 : Q x — >■ M. Suppose 
we are able to measure the expected change in distance for all pairs in U. 
Now, for any pair P,Q £ ^l, we define a shortest path P = ro, ri, . . . , = Q 
of length s from P to Q (sequence of transitions of minimal weight from 
P to Q), where (n,n+i) G f7 for all < / < s. If we define U appro- 
priately, then 5{P,Q) = X]f=o '5(n, r/+i), and by linearity of expectation, 
the expected change in distance between the pair (P, Q) is just the sum of 
the expected chang e between the pai r s (zi, zi+-\). We now state the "path 



coupling lemma" of Bublev and Dver ( 19971 ) 



Lemma 6.9 (Bublev and Dver . 199'\) (Path Coupling lemma) Let 5 be 



an integer valued metric defined on Q x Q, which takes values in {BJ. Let 
U be a subset of Q x n such that for all {Pt,Qt) € il. x Q there exists a path 
Pt = rQ,ri, . . . ,rs = Qt between Pt and Qt where (r/, r/_i) G U for < I < s 
"''"■d Y^t=o ^i^i^'^i+i) = ^iPt,Qt)- Suppose a coupling {P,Q) ^ {P',Q') of 
the Markov chain 9Jt is defined on all pairs of (P, Q) £ U such that there 
exists a (3 < 1 such that B[6{P',Q')] < (3B[6{P,Q)] for all {P,Q) G U. 



6.6. Sampling Techniques 
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Then the mixing time T(e) is bounded from above as follows: 

re < 



1-/3 



Coupling from the Past 

Coupling from the past (CFTP) (jProDP and WilsoiJ . Il996l : iHubeij . Il998l ) is 



a technique used to obtain an exact sample from the stationary distribution 
of a Markov chain. The idea is to simulate Markov chains forward from 
times in the past, starting in all pos sible states, as a co u pling process. If all 
the chains coalesce at time 0, then iPropp and Wilson! (|l996l ') showed that 



the current sample has the stationary distribution. 

Suppose 9JT is an ergodic (irreducible, aperiodic) Markov chain with (fi- 
nite) state space 0, transition probabilties P{-,-) and stationary distribution 
IT. Suppose J-" is a probability distribution on functions / : — t- with the 
property that for every u G il., its image f{u) is distributed according to the 
transition probability of SOT from state n, i.e., 

Pr(/(n) =v)= Piu,v), yu,v £n . 

To sample f G J-', we perform the following steps: (i) sample, indepen- 
dently for each u £ a, state Vu from the distribution P{v, •), and (ii) let 
f : Q ^ Q he the function mapping from u to Vu for all u E il. Now, let 
fs, ■ ■ ■ , ft-1 : — )• with s < t be an indexed sequence of functions, and 
denote by -F* : — t- $7 the iterated function composition 

Ps = ft-i ° ft-2 o • • • o /s+i o fs . 

A t-step simulation of 9K can be performed as follows starting from 
some initial state uq £ Q: (i) select /o, . . . , ft-i independently from J^, (ii) 
compute the composition Fq = ft-i o ft-2 ° • • • ° fi ° fo, and (iii) return 
Fq{uo). Of course, this is very inefficient requiring about \d\ times the work 
of a direct simulation. However, this view will be convenient to explain the 
conceptual ideas behind CFTP as given below. 

For fixed transition probabilities P(-, •), there is a considerable flexibility 
in the choice of distributions allowing us to encode uniform couplings over 
the entire state space. The Coupling Lemma can be stated in this setting. 
Suppose /o, • • • ,/t-i are sampled independently from T. If there exists a 
function t : (0, 1] — N such that 

Pt[Fq^^\-) is not a constant function] < e , 

then the mixing time r(e) of Tl is bounded by t(e). Thus, in principle, it 
is possible to estimate the mixing time of 9Jt by observing the coalesence 
time of the coupling deflned by J-, and thereby obtain samples from an 
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approximation to the stationary distribution of by simulating the Markov 
chain for a number of steps comparable with the empirically observed mixing 
time. However, in practice, the explicit evaluation of Fq is computationally 
infeasible. 

The first idea underlying Propp and Wilson's proposal was to work with 
instead of Fq, i.e., by "coupling from the past", it is possible to obtain 
samples from the exact stationary distribution. 

Theorem 6.10 Suppose that /_2, • • • is a sequence of independent sam- 
ples from T . Let the stopping time T be defined as the smallest number t 
for which F^^{-) is a constant function, and assume that ET < oo. Denote 
by F^oo unique value of F'^j, (which is defined with probability 1). Then 
F^oo distributed according to the stationary distribution o/9Jt. 

The second idea underlying Propp and Wilson's proposal is that in 
certain cases, specifically when the coupling is "monotone", it is pos- 
sible to evaluate F^rp without explicity computing the function composition 
/i o /2 o • • • o f-T+i ° f-T- Suppose that the state space Q is partially ordered 
>z with maximal and minimal elements T and _L respectively. A coupling 
is monotone if it satisfies the following property: 

M ^ f =^ f{u) >: f{v), Vti, V £ . 

If is montone, then F^^{T) = F^^{1.) implies F^^ is a constant function 
and that i^^((T) = F^j(_L) = F^. Therefore, it suffices to track the two 
trajectories starting at T and _L instead of tracking \Q\ trajectories. Since 
only an upper bound on T is needed for computing F^, a doubling scheme 
t = 1,2, 4, 8, 16, . . . is typically used rather than iteratively computing F^^ 
for t = 0,1,2,3,4,.... 



6.6.2 A Meta Markov chain 

The main contribution of this section is a generic, 'meta' approach that can 
be used to sample structures from distributions of interest given that there 
exists an exact uniform sampler. We start with the design of a Markov chain 
based on the Metropolis process (jMetroDolis et al.l . Iw^i ) to sample accord- 



ing to exponential family distributions p{y \ x, w) under the assumption that 
there exists an exact uniform sampler for y. Consider the following chain 
MetA: If the current state is y, then 

1. select the next state z uniformly at random, and 

2. move to z with probability min[l,p(z | x,w)/p{y \ x,w)]. 



6.6. Sampling Techniques 
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Exact Sampling Using Coupling from the Past 

As described in the previous section, CFTP is a teclinique to obtain an 
exact sample from the stationary distribution of a Markov chain. To apply 
CFTP for Meta, we need to bound the expected number of steps T until 
all Markov chains are in the same state. For the chain Meta, this occurs as 
soon as we update all the states, i.e., if we run all the parallel chains with 
the same random bits, once they are in the same state, they will remain 
coalesced. This happens as soon as they all accept an update (to the same 
state z) in the same step. First observe that, using Cauchy-Schwarz and 
triangle inequalities, we have 

Vy, y' e 3^ : I {cP{x, y) - cP{x, y'),w) \<2BR. 

The probability of an update is given by 

v{y I x,w) 



mm 

y,y' 



p{y' \ x,w)_ 



> exp{-2BR) 



We then have 



ET <1 X exp[-2BR]+ 

2 X (1 - exp[-2BR]) x exp[-2BR] + 

3 X (1 - exp[-2BR\)'^ x exp[-2BR] + ■■■ 

By using the identity X^So ~ a^^/{a^^ — l)'^ with a = {l—exp[— 2 BR]), 
we get ET = exp[2BR]. We now state the main result of this section. 

Theorem 6.11 The Markov chain Meta can be used to obtain an exact 
sample according to the distribution vr = p{y \ x, w) with expected running 
time that satisfies ET < exp{2BR). 

Note that the running time of this algorithm is random. To ensure that the 
algorithm terminates with a probability at l east (1 — 5), it is required to run 
it for an additional factor of ln(l/5) time ( Huber . 19981 ). In this way, we 



can use this algorithm in conjunction with the approximation algorithm for 
computing the partition function resulting in an FPRAS. The implication 
of this result is that we only need to have an exact uniform sampler in order 
to obtain exact samples from other distributions of interest. 

We conclude this section with a few remarks on the bound in Theo- 
rem lG.lll and its practical implications. At first glance, we may question the 
usefulness of this bound because the constants B and R appear in the ex- 
ponent. But note that we can always set = 1 by normalising the features. 
Also, the bound on B (cf. P r oposi tion 16. ip could be loose in practice as 
observed recently by Do et al.] ( 20091 ). and thus the value of B could be way 



below its upper bound y^ln |3^|/A. We could then employ techniques similar 
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to those described by Do et al. ( 20091 ) to design optimisation strategies that 



work well in practice. Also, note that the problem is mitigated to a large 
extent by setting A > In |3^| and R = 1. 

While in this section we focused on designing a 'meta' approach for sam- 
pling, we would like to emphasise that it is possible to derive improved 
mixing time bounds by considering each combinatorial structure individu- 
ally. As an example, we design a Markov chain to sample from the vertices 
of a hypercube and analyse its mixing time using path coupling. Details are 
delegated to Appendix IC. 31 

Mixing Time Analysis using Coupling 

We now analyse the Markov chain Meta using coupling and the coupling 
lemma (cf. Lemma l6.8p . 

Theorem 6.12 The mixing time of Meta with an exact uniform sampler 
is bounded from above as follows: 

\{lne-^)/ln{l-exp{-2BR))-^] . 

Proof Using Cauchy-Schwarz and triangle inequalities, we have 

Vy,y' G y : - Hx,y'),w)\ < 2BR . 

The probability of an update is 

> exp{-2BR) . 



mm 

y,y' 



^ p{y\x,w) 
' p{y'\x,w) 



The probability of not updating for T steps is therefore less than (1 
exp{-2BR)f. Let 

t{e) = r(lne-^)/ln(l -exp(-2Bi?))-i] . 

We now only need to show that Pr(Pi(^) ^ Qt(e)) = Consider 

Pr(Pt(.) / Qtie)) 

< (1 - exp(-2Si?))('"^)/^'^(^-'=^P(-2^^)) 

= exp[ln(l - exp{-2BR) + e - 1 + exp{-2BR))] 



e . 



The bound follows immediately from the Coupling Lemma. □ 



6. 7. Summary 
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6.7 Summary 



The primary focus of this chapter was to rigorously analyse probabilistic 
structured prediction models using tools from MCMC theory. We designed 
algorithms for approximating the partition function and the gradient of the 
log partition function with provable guarantees. We also presented a simple 
Markov chain based on Metropolis process that can be used to sample ac- 
cording to exponential family distributions given that there exists an exact 
uniform sampler. 

If we were to solve the optimisation problem (j6.ip using iterative tech- 
niques like gradient descent, then we have to run Markov chain simulations 
for every training example in order to compute gradients in any iteration of 
the optimisation r outine. We therefor e argue for using online conve x optimi- 



sation techniques (|Hazan et al.l . 120071 : IShalev-Shwartz et al.l . 120071 ) as these 



would result in fast, scalable algorithms for structured prediction. 



Chapter 7 

Conclusions 



State-of-the-art approaches for structured prediction hke structured SVMs 



(jTsochantaridis et al.l . |2005| ) have the flavour of an algorithmic template in 
the sense that if, for a given output structure, a certain assumption holds, 
then it is possible to train the learning model and perform inference effi- 
ciently. The standard assumption is that the argmax problem 

y = argmax /(x, z) 

is tractable for the output set y. Thus, all that is required to learn and 
predict a new structure is to design an algorithm to solve the argmax prob- 
lem efficiently in polynomial time. The primary focus of this thesis was 
on predicting combinatorial structures such as cycles, partially ordered sets, 
permutations, and several other graph classes. \ ye studied the limitations of 
exist i ng structured predi ction models (jCollinsl . I2OO2I : iTsochantaridis et al 



exist i ng structured predi ction models ([UoJiing, [zuyj; I i socnantaridis et ai.l . 
200,4 iTaskar et all . boO,^ ) for predicting these structures. The limitations 



are mostly due to the argmax problem being intractable. In order to over- 
come these limitations, we introduced new assumptions resulting in two 
novel algorithmic templates for structured prediction. 

Our first assumption is based on counting combinatorial structures. We 
proposed a kernel method that can be trained efficiently in polynomial time 
and also designed approximation algorithms to predict combinatorial struc- 
tures. The resulting algorithmic template is a generalisation of the classical 
regularised least squares regression for structured prediction. It can be used 
to solve several learning problems including multi-label classification, ordi- 
nal regression, hierarchical classification, and label ranking in addition to 
the aforementioned combinatorial structures. 

Our second assumption is based on sampling combinatorial structures. 
Based on this assumption, we designed approximation algorithms with prov- 
able guarantees to compute the partition function of probabilistic models for 
structured prediciton where the class conditional distribution is modeled us- 
ing exponential families. Our approach uses classical results from Markov 
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chain Monte Carlo theory ( Jerrum and Sinclair . 19961 : Randall . 20031 ). It 
can be used to solve several learning problems including but not limited 
to multi-label classification, label ranking, and multi-category hierarchical 
classification. 

We now point to some directions for future research. In Section 15.41 
we designed approximation algorithms for solving the argmax problem for 
combinatorial output sets, but could do so with only weak approximation 
guarantees. While the approximation factor has no effect in training models 
such as structured ridge regression (cf. Chapter [5]), we believe there is a need 
to further investigate the possibilities of designing approximation algorithms 
with improved guarantees. We describe two promising directions for future 
work below. 



Approximate Inference using Simulated Annealing 

If we restrict ourselves to linear models, then the argmax problem reduces 
to a linear programming problem: 

y = argmax {w, (j){x, y)) . 

y&y 



Recently. iKalai anc 



(jKirkpatrick et al 



Vempalal (|200fil l considered using simulated annealing 



198 



^) to minimise a linear function over an arbitrary 



convex set. More precisely, they considered the following linear minimisation 
problem: for a unit vector c G M" and a convex set K C M": 

min (c, z) , 

under the assumption that there exists a membership oracle for the set 
K. The algorithm, described below, goes through a series of decreasing 
temperatures as in simulated annealing, an d at each temperature, it u ses 
a random walk based sampling algorithm ( Lovasz and Vempala . 20061 ) to 



sample a point from the stationary distribution whose density is proportional 
to exp(— (c, z) /T), where T is the current temperature. Let R be an an 
upper bound on the radius of a ball that contains the convex set and r 
be a lower bound on the radius of a ball around the starting point contained 
in K. At every temperature (indexed by i), the algorithm performs the 
following steps: 

1. Set the temperature Tj = R{1 — Yj^fnf . 

2. Move the current point to a sample from a distribution whose density 
is proportional to exp(— (c, 2:) /Tj), obtained by executi ng k steps of a 
biased random walk (refer ( Lovasz and Vempala . 20061 ) for details). 



Using 0*(n)^ samples observed during the above walk and estimate 
the covariance matrix of the above distribution. 



^The O* notation hides logarithmic factors. 
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Theorem 7.1 i Kalai and Vempalci . 200d) For any convex set K G M", with 



probability 1 — 5, the above procedure given a membership oracle for the 
convex set K, starting point zq, R, r, I = 0{y/n\og{Rn/e5)), k = 0*{n'^), 
and N = 0*{n), outputs a point zj G K such that 

(c, zj) < min (c, z) + e . 

zdK 

The total number of calls to the membership oracle is 0*{n'^'^). 

While the combinatorial structures considered in this work do not form 
a convex set, it would be interesting to consider the convex hull of such sets, 
for instance, the convex hull of the vertices of a hypercube and derive results 
similar in spirit to Theorem 17.11 For sampling, we can use the algorithms 
described in Section [6.61 to sample from exponential family distributions. 

Approximate Inference using Metropolis Process 

The Metropolis process (iMetropohs et al.l . 1 19531 ) can be seen as a special 



case of simulated annealing with fixed temperature, i.e., the Markov chain 
on the state space is designed to be time-homogeneous. This chain can 
be used to maximise (or mi nimise) objective functions / : 17 — )• R defined 
on the combinatorial set Q, ( Jerrum and Sinclair . 19961 ). Consider the fol- 



lowing chain MC(q) which is similar to the 'meta' Markov chain described 
in Section [6.61 If the current state is y, then 

1. select the next state z uniformly at random, and 

2. move to z with probability min(l, a-^*^^)"''^^^)), 

where a > 1 is a fixed parameter. The stationary distribution of this chain 
is 

7ra(-) = -WT—r , 

Z{a) 

where Z[a) is the partition function. 

Note that tTq, is a monotonically increasing function of /(•) as desired. 
It is now crucial to choose the parameter a appropriately. Having a small 
a would make the distribution tt^ well concentrated whereas having a large 
TTa would make the chain less mobile and susceptible to locally optimal 
solutions. Note than we can analyse the mixing time of this chain using the 
techniques described in Section 16.61 The probability of finding the optimal 
solution is at least the weight of such solutions in the stationary distribution 
TTa- Therefore, if we can derive non-trivial upper bounds on the weights of 
optimal solutions, then we can be certain of hitting such solutions using 
the chain with high probability. The success probability can be boosted by 
running the chain multiple times. Choosing an appropriate a and deriving 
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non-trivial upper bounds on the the weights of optimal solutions is an open 
question. Note that if we set a = exp(l), we are essentially sampling from 
exponential family distributions. 



Appendix A 

Kernels and Low-dimensional 
Mappings 



The J ohnson-Lindenstrauss lemma (|Johnson and Lindenstraussl . ll984l : lDasgupta and Guptal . 



states that any set of m points in high dimensional Euclidean space 



can be mapped to an n = 0(lnm/e^) Euclidean space such that the dis- 
tance between any pair of points are preserved within a (1 it e) factor, for 
any < e < 1. Such an embedding is constructed by projecting the ni points 
onto a random fc-dim e nsion al hy per sphere. A s imila r result was proved by 
Arriaga and Vempala ( 19991 ) and Balcan et alJ ( 20061 ) in the context of ker- 



nels. Given a kernel matrix K in the (possibly) infinite dimensional (p-space, 
it is possible to compute a low-dimensional mapping of the data points that 
approximately preserves linear separability. Specifically, if the data points 
are separated with a margin 7 in the (p-space, then a random linear projec- 
tion of this space down to a space of dimension n = 0(;^ln^) will have a 



linear separator of error at most e with probability at least 1 — S. 

One of the drawbacks of using this random projection method for low- 
dimensional mapping of kernels is the need to store a dense random matrix 
of size nxm which is prohibitively expensive for large-scale applications. To 
circumvent this problem, one could resort to Nystrom approximation of the 
eigenvectors and eigenvalues of the kernel matrix using random sampling. 
This technique is a fast, scalabl e version of the classical multi-dimensional 
scaling (MDS) (see ffl, » and references therein), and is referred to 



as NMDS in the following. 

Let be a positive definite matrix. Choose k <^ m samples (or data 
points) at random, re-order the matrix such that these samples appear at 
the beginning, and partition it as 

^ = {b^ C 

Let A = UAU^ , n be the dimension of the embedding, [/[„] and Ar„i be 
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the submatrices corresponding to the n largest eigenvectors and eigenvalues 
respectively. Then the low-dimensional embedding (matrix) is 

The computational complexity of this approach is 0{nkm + k^). 

The NMDS approach cannot be applied if the input matrix is indefi- 
nite. For example, in learning on graphs, the input is a sparse adjacency 
matrix but not a kernel. In such cases, we would still like to find a low- 
dimensional embedding of this graph without the need to compute a kernel 
as an intermediate step. Note that if we use the (normalised) Laplacian as 
the input, then the algorithm would output the leading eigen vectors, but we 



need to compute the trailing eigenvectors of the Laplacian. iFowlkes et al 



proposed a two-step approach to solve the problem even when the 



input matrix is indefinite. First, NMDS is applied to the input matrix. 
Let = [U^A-^U^B], Z = tJK^/^ and let VT.V^ denote the orthog- 
onalisation of Z. Then the matrix Z^S^^/^ contains the leading or- 
thonormalised approximate eigenvectors of K and the low-dimensional em- 
bedding is given by ZV . The computational complexity is still in the order 
of 0{nkm + k^), but with an additional 0{k^) orthogonalisation step. 
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Counting Dicyclic 
Permutations 

We represent a directed cyclic permutation by the set of (predecessor, successor)- 
pairs. For instance, the sequences {abc, bca, cab} are equivalent and we use 
{(a, 6), {b, c), (c, a)} to represent them. Furthermore, we define i^'^y-^{y) = 1 
if {u^v) G y, i.e., v follows directly after u in y; = ~1 if iy,u) G y, 

i.e., u follows directly after v in y; and otherwise. 

For a given alphabet E, we are now interested in computing ID^'^^'^I, the 
number of cyclic permutations of subsets of S. For a subset of size i there 
are i\ permutations of which i represent the same cyclic permutation. That 
is, there are (i — 1)! cyclic permutations of each subset of size i, and for an 
alphabet of size A'' = |E| there are 

N , . 

different cyclic permutations of subsets of S. 

Computing ^i^^^ is simple. Observe that, for each cyclic permutation 
containing a (predecessor, successor)-pair {u,v), there is also exactly one 
cyclic permutation containing {v,u). Hence the sum over each feature is 
zero and ^^^^ = (where is the vector of all zeros). 

It remains to compute C^^^. Each element of this matrix is computed as 

As seen above, for u = u' and v = v' there are as many cyclic permutations 
for which V'(^*^„) = +1 ^ there are cyclic permutations for which ip^^'^y-^ = 
In both cases, iPll^y-){y)tpllfy)iy) = +1, and to compute C^l'^^^^f^^^ it suffices 
to compute the number of cyclic permutations containing {u,v) or {v,u). 
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There axe (i — 2) ! different cyclic permutations of each subset of size i that 
contain {u,v). We thus have 

iV / s 

iu,v),{u,v) \ I - 2 ^ ' 

i=?> ^ ^ 

Similarly, for u = v' and v = u', we have = ~1 

{u,v),{v,u) A-^yi - 2 J ^ ' 

For u ^ u' ^ V ^ v' ^ we observe that there are as many cyclic 
permutations containing («, v) and {u! , v') as there are cyclic permutations 
containing {u, v) and {v' , u'), and hence in this case C^^^^-^ = 0. Finally, 
we need to consider C?^'^ . , ^ , C?^'^ . , , Cf^'^ . , ^ , and Cf^'^ . , , Here 
we have 

<^(T«),(«y) = ^Tu,vUu',v) = -2 E (i - 3)! and 

'^{lluv,v') = '^{ll),{u',u) = +2 E ( ^ _ 3 ) - 3)! ■ 



Appendix C 

Appendix for Chapter 



C.l Approximating the Partition Function using 
Approximate Samples 

In Section (631 we designed an FPRAS for approximating the partition func- 
tion under the assumption that there exists an exact sampler. We now con- 
sider the case where we only have approximate samples resulting from a 
truncated Markov chain. Let Tj denote the simulation length of the Markov 
chains, for all z S |/]. The main result of this section is as follows: 

Theorem C.l Suppose the sample size S = [65e-2/exp(2/p)] and suppose 
the simulation length Ti is large enough that the variation distance of the 
Markov chain from its stationary distribution vr^,. is at most e/5/ exp(2/p). 
Under the assumption that the chain is rapidly mixing, there exists an 
FPRAS with e as the error parameter for computing the partition function. 



Proo f The proof again uses techniques described in (jjerrum and Sinclair 



1996h . The bound Var/i/(E/i)2 < exp(2/p) (from Proposition [63]) w.r.t 



the random variable fi will play a central role in the proof. We cannot 
use this bound per se to prove approximation guarantees for the partition 
function Z{w \ x). This is due to the fact that the random variable fi 
is defined w.r.t. the distribution tt/j., but our samples are drawn from 
a distribution vr^, resulting from a truncated Markov chain, whose varia- 
tion distance satisfies Itt/^^ — tt/^J < e/5/ exp(2/p). Therefore, we need to 
obtain a bound on Var/j/(E /j)-^ w.r.t. the random variable fi defined 
analogously to fi with samples drawn from the distribution vr^.. An in- 
teresting observation is the fact that Lemma 16.41 still holds for pi, i.e., 
exp(l/p) — 1 < Pi < exp(— -|- 1, for all integers p > 3, and using similar 
analysis that followed Lemma 16.41 we get 

^^^^ < exp(2/p), V i G [/] . 
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Also, note that |7r^-— vr^J < e/5/exp(2/p) implies \pi — pi\ < e/5/exp(l/p) 
(using the fact that exp(— < pi < exp(l/p)). Therefore, 

il-^^)p^<P^<{l + ^^)p^ ■ (C.l) 

Equipped with these results, we are ready to compute the sample size 
S needed to obtain the desired approximation guarantee in the FPRAS. 
LetX«,...,xf) be a sequence of S independent copies of the random 
variable obtained by sampling from the distribution vr^. , and let Xi = 
S~^J2j=i-^i'''^ be the sample mean. We have EXj = E/j = pi, and 
VarXj = 5~"'^Var/j. The final estimator p = Z{w \ x)~^ is the random 
variable X = W\^^ Xi with EX = 0^=1 Pi = P- From (fUTll . we have 

{l-\)p<P<{l + \)p. (C.2) 



Now, consider 



(EX)2 \}\ +(EX,)2 



S 

[le 

< exp 



^exp(|)^ 



S 



< eV64 



if we choose S = [65e ^/exp(2/p)] (because exp(a/65) < 1 + a/64 for 
< a < 1). By applying Chebyshev's inequality to X, we get 

r.i / ,<N~n 16 VarX 1 

Pr[(|X-/,|)>(e/4)p]<-^^^<-, 

and therefore, with probability at least 3/4, we have 

(i_£)p<x<(i + |)p. 

Combining the above result with (|C.2|) , we see that with probability at least 
3/4, the partition function Z{w \ x) = X~^ lies within the ratio (1 it e/4) 
of p~^. Polynomial run time follows from the assumption that the Markov 
chain is rapidy mixing. □ 
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C.2 Approximating the Gradient of the Log Parti- 
tion Function using a Reduction from Count- 
ing to Samphng 

In this section, we show how to approximate the gradient of the log par- 
tition function using the reduction from counting to samphng described in 
Section 16. 3i Recall that the gradient of the log partition function generates 
the first order moment (mean) of (j){x,y), i.e., 

E 4>{x,y)exp{{w,(j){x,y))) 

V^lnZ(u7 I x) = — — r-- 

L exp((u;,0(x,7/))) 

y&y 

= E Wx,y)] . 

yr^p(y\x,w} 

The numerator in the above expression is the quantity of interest and it 
can be seen as a weighted variant of the partition function Z{w \ x) where 
the weights are the features (t){x,y). We will use 4>j{x,y) to denote the jth 
component of 4){x,y) and let Zj{w \ x) = '^y^y 4>j{x,y) eyi.\i{{(j){x,y),w)). 
Consider again the random variable fi{y) = exp[(/3j_i — /3j) y), w)], 
where y is now chosen according to the distribution 

^ 4>j jx, y) exp( (0(a;, y) , /Sjw)) 
lly(iy<l^j{x,y)e^v{{(t>{x,y),(5iw)) 

i.e, we use the same random variable as defined in Section 16.31 but sample 
according to a slightly different distribution as given above. It is easy to 
verify that fi{y) is an unbiased estimator for the ratios in the telescoping 
product of the quantity of interest, i.e, 

Zjifii^iw I x) 
y-TT^o Zj{^iW I x) 

where / is the length of the cooling schedule (cf . Section 16. Sp . It remains to 
analyse the mixing time of the Markov chain with stationary distribution tt^^ . 
Since features can take the value of zero. Theorems 16.111 and 16.121 cannot 
be applied. One solution to this problem would be to modify the state 
space of the Markov chain in such a way that we sample (uniformly) only 
those structures satisfying |(/)j(x,y)| > 7, where 7 is a parameter, and then 
run the Metropolis process. It would be interesting to further analyse the 
approximation that is introduced by discarding all those structures satisfying 
\(t)j{x,y)\ < 7, but we do not focus on this aspect of the problem here. 

A note on computational issues follows. At first glance, it may seem com- 
putationally inefficient to run the machinery for every feature, but note that 
it is possible to reuse the computations of individual ratios of the telescoping 
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product by designing an appropriate cooling schedule. First, consider the 
following expression: 

^ (j)j{x, y) exp((u', (/>(x, y))) 
yey 

= ^exp{{w,(l){x,y)) + ln(l)j{x,y)) 

y&y 

= ^exp(cj {w,(l){x,y))) , 

where Cj = 1 + ln(j)j{x,y)/ eyip{{w,(j){x,y))). Let c = maxj cj. The cooling 
schedule is then given as 

12 3 cplR\\w\\\ 
q q q q 

where q = cpR\\w\\. 

C.3 Mixing Time Analysis of MCcube using Path 
Coupling 

The state space O is the vertices of the d-dimensional hypercube {0, l}'^. 
Consider the following Markov chain MCcube on 17. Initialise at and repeat 
the following steps: (i) pick {i,b) G [d] x {0,1}, and (ii) move to the next 

state, formed by changing ith bit to b, with probability min (l,^y Let 

d{-,-) denote the Hamming distance. The transition probabilities of this 
chain are given by 



P{u,v) 



^min(l,^) , iid{u,v) 



1; 



|, ii u = v; 
0, otherwise . 



We now analyse the mixing time of MCcube using path coupling (cf. 
Section 16.6. ip . Recall that the first step in using path coupling is to define 
a coupling on pairs of states that are close to each other according to some 
distance metric (5 : J7 x — t- M on the state space. 

Theorem C.2 The Markov chain MCcube on the vertices of a d- dimensional 
hypercube has mixing time r(e) = 0{d\n{de-^)). 

Proof We will first prove the bound on the mixing time for uniform sta- 
tionary distribution and later generalise it to arbitrary distributions. 

We choose Hamming distance as the metric, and consider pairs of states 
{u, v) that differ by a distance of 1. To couple, we choose (i, b) G [d] x {0, 1} 
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uniformly at random, and then update to (n', v') formed by changing ith bit 
to b if possible. We now need to show that the expected change in distance 
due to this update never increases. More precisely, we need to prove that 



and invoke the path coupling lemma (cf. Lemma 16. 9p to obtain the final 
result. Let u and v differ at the j-th. bit. We have the following two cases. 

• Case 1: i ^ j- In this case, if Ui = Vi = b, then there is no update 
and therefore no change in distance. If Ui = Vi ^ b, then both u 
and V update their i-th bit and therefore, again, there is no change in 
distance. 

• Case 2: i = j. In this case, there is definitely a decrease in distance 
with probability l/d as one of ti or u (but not both) will update their 
i-th bit. 

We therefore have 



with /? < 1 as desired. Invoking the path coupling lemma gives us the 
following bound on the mixing time 



Let z be the new state formed in cases 1 and 2 above. For non-uniform 
distributions, we have the following: 



- ^[Piu,w)+P{v,w)] . 



Note that z differs from u and f by a single bit, and therefore under the 
assumptions that 



E[5{u',v')] = I36iu,v), /3<1 



mu',v')] 




fi5{u,v) 



d-l 



P{u,z) 



min 1 



' 7r(n) 
tt{z) 



1 and 



P{v,z) 



min 1 



1 , 



we have, once again. 



¥.[5{u',v')] = /35{u,v) 



with /3 < 1. 



□ 
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